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Preface 



This year’s International Workshop on Networked Group Communications (NGC) 
was the third event in the NGC series following successful workshops in Pisa, 
Italy (1999), and Stanford, CA, USA (2000). 

For the technical program this year, we received 40 submissions from both 
academia and industrial institutions all around the world. For each paper, we 
gathered at least three (and sometimes as many as five) reviews. After a thorough 
PC meeting that was held in London, 14 papers were selected for presentation 
and publication in the workshop proceedings. The program continues the themes 
of previous years, ranging from applications, through security, down to group 
management, topological considerations, and performance. 

The program committee worked hard to make sure that the papers were not 
only thoroughly reviewed, but also that the final versions were as up to date and 
as accurate as could be managed in the usual circumstances. 

This year, the workshop was held in the London Zoo, hosted by University 
College London (the tutorials having been held at UCL the day before the start 
of the paper sessions). The proceedings are published by Springer- Verlag in the 
LNCS series. 

There are signs that the scope of the NGC workshops is broadening to include 
hot topics such as group communication and content networking. We expect to 
see more submissions in these areas, as well as other new topics, for NGC 2002, 
which is to be held at the University of Massachusetts at Amhurst, chaired by 
Brian Levine and Mostafa Ammar. 

We hope that the research community will find these proceedings interesting 
and helpful and that NGC will continue to be an active forum for research on 
networked group communication and related areas for years to come. 
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Abstract. Multiplayer online games represent one of the most popu- 
lar forms of networked group communication on the Internet today. We 
have been running a server for a first-person shooter game, Half-Life. In 
this paper we analyse some of the delay characteristics of different play- 
ers on the server and present some interim results. We find that whilst 
network delay has some effect on players’ behaviour, this is outweighed 
by application-level or exogenous effects. Players seem to be remarkably 
tolerant of network conditions, and absolute delay bounds appear to be 
less important than the relative delay between players. 



1 Introduction 

Multiplayer online games represent one of the most popular forms of networked 
group communication on the Internet today, and they contribute to an increas- 
ingly large proportion of network traffic PH- There has been little work to anal- 
yse or characterise these applications, or to determine any specific user or net- 
work requirements. The real-time nature of many of these games means that 
response times are important, and in a networked environment this means that 
round-trip delays must be kept to a minimum. Is network delay, however, the 
most important factor in a user’s gaming experience? In this paper we examine 
the relationship between application-level delay and player behaviour in multi- 
player networked games. The main question that we wished to answer was “How 
important is delay in a player’s decision to select and stay on a particular games 
server?” . To achieve this, we have been running a publicly-accessible server for 
one of the more popular FPS (First Person Shooter) games, Half-Life m- From 
this server, we have logged and analysed usage behaviour at both the application 
and network level. The paper is structured as follows. In Section|2|we look at pre- 
vious work and discuss some expectations we had prior to this study. Section 0 
describes our server setup and data collection methodology. Section 0 describes 
the results that we observed, and Section 0 outlines directions for further work. 



* The author is funded by a Hewlett-Packard EPSRC CASE award. 
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2 Background 



In this section we discuss previous work on multiplayer games, and what this 
led us to expect before commencing this study. 

Previous analysis of popular commercial networked games has thus far con- 
centrated on observing local area network traffic and behaviour m, and net- 
work-level rather than user- and -session-level characteristics. There is also little 
empirical analysis of the delay requirements for real-time multiplayer applica- 
tions. However, it is generally accepted that low latencies are a requirement. 
Cheshire [3 proposes 100ms as a suitable bound, whilst human factors research 
indicates that 200ms might be a more appropriate limit Q . The IEEE DIS (Dis- 
tributed Interactive Simulation) standard stipulates a latency bound of between 
100ms and 300ms for military simulations jS]. MacKenzie and Ware find that in 
a VR (Virtual Reality) environment, interaction becomes very difficult above a 
delay of 225ms cni. Such previous work implies that there should be an abso- 
lute bound to player delay, beyond which players’ gameplay becomes so impaired 
that they would either abort the game or find another server. 

MiMaze was a multiplayer game which ran over the multicast backbone 
(MBone). Some statistics related to a network session of the game are presented 
in 1^. Using a sample of 25 players, they find that the average client delay is 
55ms, and as such the state synchronisation mechanisms are designed with a 
maximum delay of 100ms in mind. The limited nature of the MBone, however, 
means that the measurements taken might not be representative of games today. 
The average delay is far lower than what we observe, and might result from the 
fact that all clients were located within the same geographic region (France) and 
well-connected to each other via the MBone. 

Like MiMaze, Half-Life uses a client-server architecture, but the distribution 
model is unicast. Players connect to a common server, which maintains state 
about the nature of the game. The objective of the game is to shoot and kill 
as many of the other players as possible. This setup is representative of most of 
the popular EPS games. Games are typically small, between 16 and 32 players, 
and there are several thousand servers located all over the Internet. Table E 
shows the average number of servers for some of the more popular games. We 
obtained these figures by querying master servers for these games every 12 hours 
for two months. Since there are lots of small groups on servers located all over 
the Internet, it is reasonable to assume that players will attempt to connect to 
one of the servers with the lowest delay. This implies that most players would 
come from geographic locations near to our server, assuming that these have 
lower delays. 

The nature of EPS games means that delay should be an important factor 
in the gaming experience. Players run around a large “map” (the virtual world) 
picking up weapons and firing at each other. Low response times should therefore 
be an advantage, since players can then respond more successfully to other users. 
If, however, the main benefit of low delay is to gain an advantage over other 
players, then we might expect that absolute delay is not so important as the 
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Table 1. Average Number of Servers for Different FPS Games. 



Game 


Average number of servers 


Half-Life 


15290.52 


Unreal Tournament 


2930.16 


Quake III Arena 


2217.93 


Quake II 


1207.43 


Tribes 2 


968.67 


QuakeWorld 


389.28 


Quake 


84.94 


Sin 


49.05 


Tribes 


42.51 


Kingpin 


42.20 


Heretic II 


9.12 



variance of delay, if a player needs only to have a relatively low delay compared 
to the other players in the game. 

Usability analysis of non-networked computer games, e.g. |2| indicates that 
many actions become routine for players as they become expert at the tasks 
involved, but whether this still holds when the opponents are less predictable 
(i.e., other humans) is unclear. Unfortunately, we know of no usability studies 
that specifically analyse networked games, but perhaps regular and highly-skilled 
players might be able to better tolerate delay, as they become more adept at the 
specific details and strategy of a particular game, such as Half-Life. 

3 Methodology 

We recorded users than connected to a games server that we set up at University 
College London (UCL) in the UK. This server comprised a 900MHz AMD Athlon 
PC with 256 Mb of RAM, running Linux kernel version 2.4.2, and was connected 
to our departmental network via 100BaseT Ethernet. To prevent the possibility 
of users being prejudiced by connecting to an academic site, we registered a 
non-geographic .com domain name to use instead of a cs.ucl.ac.uk address. The 
server was advertised to potential players only by using the game’s standard 
mechanisms, whereby a server registers with a “master server” . These master 
servers exist to provide lists of game servers for players; when a player wishes to 
play a game, they either connect to a known IP address (obtained through out- 
of-band information or from previous games), or they query the master server 
to find a suitable game server. 

The game was set to rotate maps every 60 minutes, so as to keep the game 
interesting for existing and potential players. In addition, players were permitted 
to vote for the next map or to extend the current map at each map rotation 
interval. The number of players permitted to access the server was arbitrarily 
set to 24; although the game can potentially support a much higher number of 
players, most of the more popular maps only effectively scale to 24 players due 
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to a lack of “spawn points” (locations where players can enter the map) . There 
were no specific game-based sessions or goals imposed; players were free to join 
and leave the server at any time. 

Player behaviour was monitored at both the application and the network 
level. For application-level logging, we took advantage of the server daemon’s 
built-in logging facilities, and augmented this with an additional third-party 
server management tool to provide more comprehensive logs. Packet-level mon- 
itoring used tcpdump, which was set to log UDP packet headers only. 

The data that is analysed here derives from running the server between 21 
March 2001 18:33 GMT and 15 April 2001 08:28 BST. In this time we observed 
31941 sessions (a single user joining and leaving the server). 

3.1 Determining Unique Users 

Many of the issues that we examine in Section 0 require knowledege of which 
sessions correspond to which particular users, for example, examining the average 
delay across all of a particular players’ sessions. Such persistent user/session 
relationships cannot be determined by network-level traces alone, and session- 
level data is required. However, the nature of most FPS games, where any user 
can connect to any appropriate server with a minimal amount of authentication, 
means that determining which sessions belong to which users can be difficult. 

Connecting to a Half-Life server is a two-stage process. The client first au- 
thenticates with the so-called “WON Auth Server” (the acronym WON stands 
for World Oppponent Network, the organisation that runs the gaming web- 
site http://www.won.net). The authentication server issues the player with a 
“WONID” , a unique identifier generated using the player’s license key, which is 
provided with the CD-ROM media when a player purchases the Half-Life soft- 
ware. There is thus one unique WONID for each purchased copy of the game. 
Once a WONID has been generated, the player can connect to the Half-Life 
server of their choice. 

Unfortunately, using the WONIDs as a means of identifying unique players 
proved insufficient. We observed a large number of duplicate WONIDs, indicated 
by simultaneous use of the same WONID, or players with the same WONID 
connecting from highly geographically dispersed locations. This duplication of 
WONIDs is probably due to the sharing of license keys or the use of pirate 
copies of the game, and so the same WONID can represent more than one 
user. This situation is exacerbated because the game server program does not 
reject multiple users with the same WONID from playing simultaneously (this 
occurred 493 times during the period of this study) . In addition, on two occasions 
the WON Authentication Server seemed to malfunction, issuing all users with a 
WONID of 0. Although it would have been possible to modify the server to reject 
simultaneous duplicate WONIDs, this would not resolve the problem of different 
players connecting at different times with the same WONID, and so we needed 
to try and determine which sessions belonged to which different individuals. 

The identifying information logged by the server for each player is the player’s 
WONID, their IP address and port number, and the nickname that they choose 
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to use in the game. Of the 14776 total WONIDs that we observed, 11612 had 
unique (WONID, nickname, IP address, port) tuples; we can be reasonably sure 
that each of these represents one unique user. Of the remaining WONIDs, we had 
to make some assumptions about users in order to determine their uniqueness. 
We assume that a unique (WONID, nickname) tuple across all sessions is a sin- 
gle user, and probably has a dynamically-assigned IP address or multiple access 
ISPs. Users tend to change their names quite often, both between and during 
sessions, and by looking at all the names used by a particular WONID and look- 
ing for common names it was possible to further isolate potential unique users. 
When multiple users with the same WONID were simultaneously connected we 
assume that these represent different players. Using these heuristics, we estimate 
that the 14776 WONIDs actually represent 16969 users. 

3.2 Measuring Delay 

To measure the delay observed by each player we used the game server’s built- 
in facilities. The server was set up to log the application- level round-trip delay 
every 30 seconds. Of the total 1314007 measurements, we removed the 16306 
with a value of 0, assuming that they are errors. We also saw 10592 measure- 
ments greater than 1000ms, with a maximum of 119043ms. We remove these 
measurements, also assuming that they are errors, since it is unlikely that any 
user would be able to play a networked game effectively with a 2 minute delay. 
Moreover, a similar FPS game. Quake III Arena, also assumes that client delays 
over 1000ms are errors, but chooses not to report them to users, and so we do 
the same here. 

We did not measure network-level delay, e.g. through ICMP pings, since 
one of our experimental design criteria was that we did not want to alter the 
server in any way, or send additional traffic to players, in case this altered player 
behaviour or deterred some potential players from connecting to the server. 
With over 15,000 other potential servers to choose from, we did not wish to alter 
conditions in such a way that players might be driven elsewhere. In any case, 
it is the application-level delay which the users themselves observe and thus 
one would expect that this would have a greater effect on their behaviour than 
network-level delay. Informal testing showed us that on a lightly-loaded client, 
the delays reported by Half-Life are within 5ms of a network-level ping, but, 
unsurprisingly, this difference rises with client and server load. Unfortunately, 
without access to the source code of the game, we cannot be sure what causes 
the erroneous (> 1000ms) delays. 

Although Half-Life does include features for refusing players admission de- 
pending on their delay, and for compensating for variations in player delays |3|, 
these were disabled on our test server, since these might influence any results 
concerning relative delays. 

In addition to measuring the application-level delay, we also performed a 
whois lookup on players’ IP addresses in order to obtain some indication of their 
area of origin. Further details of the methodology and results of this analysis. 
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and anonymised versions of the server logs, can be found at the author’s webpage 
(http ; //www. cs . ucl . ac . uk/ staff/T. Henderson). 

4 Results 

Our main results can be summarised as follows: 

— There is a wide distribution in players’ average delay, with over 40% of 
players experiencing delays of over 225ms. 

— Delay does not appear to play a part in a player’s decision to return to the 
server, or to stay on the server. 

— Most players connect during 1800-2400, according to their respective time- 
zones. 

— There is some correlation between a player’s ability and their delay and 
session duration. 

— Social bonds do not appear to have an effect on player behaviour. 

4.1 Absolute Delay 



Distribution of average deiay for aii users 




Fig. 1. Distribution of Players’ Average Delay. 



Figured shows the distribution of the average delay observed over the du- 
ration of each player’s session. The largest proportion of players appear to have 
delays of between 50 and 300ms, while 95% of players have delays under 533ms 
(Table I2|). The large number of users with high delays of over 225ms is interest- 
ing since gameplay should theoretically be quite difficult at this level. However, 
Figure [D also includes the delays of “tourists”; those players who connect to a 
server, examine the status of the game and then choose to leave. Figure |2(a)| 
shows the distribution of delay for all the players compared to those who stay 
less than a minute, and those who stay more than 10 minutes and 1 hour. It can 
be seen that the delay of those players who stay less than a minute is generally 
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higher than those who stay for longer. A player with a delay of over 400ms is 
2.68 times as likely to stay for less than one minute. This implies that delay is 
a determinant in a player’s decision to join a server; players with high delays to 
a particular server will look elsewhere. 

If, however, 100-225ms represents the upper delay bounds for interaction, 
then we would expect that most of the players with delay above this level would 
choose other servers. Yet 40.56% of the players who stay for more than one 
minute, have average delays of over 225ms, and there is no significant difference 
in the duration of players with delays over 225ms. 



Distribution of piayer deiays Deiay distribution of repeat piayers 





(a) Distribution of Players’ 
Average Delay. 



(b) Distribution of Repeat 
Players’ Delay. 



Fig. 2. Kernel Density Functions of Delay Distribution. 



We define regular players as those who played more than 10 times and whose 
average session duration exceeded one minute. There were 279 such players. 
Figure |2(b)| indicates that the repeat players’ mean delay tends to be lower, but 
this is statistically insignificant at a 5% confidence level. 



Table 2. Overall Delay Results. 



Players 


Mean delay (ms) 


95th percentile 


All 


232 


533 


Regular 


176 


424 


Tourists 


339 


733 
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4.2 Relative Delay 

Absolute delay bounds might not be that important because players become 
accustomed to high delays, or they have no choice because they happen to have 
poor network connectivity. A more important delay metric might be the relative 
delay between players. If one player has a much lower delay than the other 
players in the game, they might be able to exploit this advantage, by attacking 
players before they are able to respond. 

We measure the relative delay in two ways. First, we look at the nominal 
results and calculate the standard deviation of players’ delay to give an estimate 
of range. Secondly, we analyse the ordinal data and look at a player’s rank in 
terms of delay compared to the other players. 

For most of the time, there is a deviation of around 100ms between players 
(Figure 0). This seems reasonable, given that most players’ delay is in the region 
of 100-200ms. 



Standard deviation of deiay tor piayer duration >= 1 min 




0 too 200 300 400 

Standard deviation of delay 



Fig. 3. Distribution of Standard Deviation of Delay. 



The “delay rank” of a particular player was calculated by ordering the players 
at each delay measurement period by delay to produce a rank r, and then scaling 
the number of players n against the potential maximum number of players of 24, 
i.e. r X 24/n. Thus, the player with the highest delay would always receive a delay 
rank of 24, whereas the minimum possible score of 1 would only be possible if 
the player had the lowest delay and there were 24 players on the server. Figure 0 
indicates some correlation between the delay ranks; players who leave tend to 
have a higher rank. 



4.3 Leaving a Game 

If delay is a determinant of user behaviour, then one might expect to see a change 
in delay towards the end of a user’s session. A sudden increase in delay might 
lead a user to leave the server and connect to another one, or give up playing 
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Delay ranking of players with duration <= 1min 




5 10 15 20 



Delay ranking of players with duration >= 1hr 




5 10 15 20 



Rank 



Rank 



(a) Delay Ranks where Player 
Duration < Imin. 



(b) Delay Ranks where Player 
Duration > Ihr. 



Fig. 4. Relative Delay Ranks. 



altogether. We see little evidence for this hypothesis, however. Figure 5(a) shows 
the “exit delay” (the delay over the last 5 % of a player’s session) compared to 
the average delay over the length of the session. This ratio congregates around 
the value of 1; i.e., the exit delay is usually comparable to the average delay. 



exit delay compared to average delay Ratio of tourist's delay compared to other players 





(a) Players’ Exit Delay Com- 
pared to Average Delay. 



(b) Tourist Delay Compared 
to Other Players. 



Fig. 5. Exit and Tourists’ Delay. 
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Relative delay does not seem to be a determinant of tourists leaving the 
server, either. Figure |5(b)| shows the ratio of the delay of those tourists who join 
and leave the server, compared to the delay of the players already on the server. 
The mean is 1.618, and there is no correlation between the two. 



4.4 When Do Players Play? 



Average game membership grouped by day 




(a) Average Number of Players 
Each Day. 



Average game membership for players 
with duration >= lOmin grouped by day 




(b) Average Number of North 
American and European Players 
Each Day with Duration > 10 
min. 



Fig. 6. Number of Players over Time. 



Figure |6(a)| shows the average number of players for each day of the week, 
sampled every thirty minutes. There is a strong time-of-day component, which 
agrees with our previously-observed results 0 • This is perhaps surprising given 
that players come from areas with different time zones; Figure |6(b) shows the 
number of players from Europe and North America (determined from the whois 
database). The offset in the respective peak times is probably due to time zones. 
Unsurprisingly, the peak usage times are in the late evening, from around 1800 
to 2400. If users tend to play in their spare time, then perhaps they have already 
allocated this time for gameplay, and so are willing to put up with whatever 
network conditions they happen to experience. 



4.5 Player Skill 

A player’s ability might have an effect on the delay that they can tolerate. A user 
who is highly skilled at playing the game might be able to cope with higher delays 
than beginners, since they might be able to predict other player’s behaviour and 
thus compensate for higher-than-average lag. In human factors terms, a high level 
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of skill might lead to players being able to perform actions without conscious 
awareness — in other words, playing the game becomes automatic. 

The session-level logs include details of which players killed each other, and 
with which particular weapon. Using this information it is possible to estimate 
of the skill of each player. Whenever a kill takes place, we calculate the players’ 
skill using the following formula: 

Sfc = Sfc -I- Sfc / Sd * w; Sd = Sd — Sfe/ Sd * w 

where Sk = killer’s skill, Sd = killed player’s skill, and w is an adjustment for 
the weapon used (e.g., it is harder to kill with a crowbar than a machine gun). 

Using this metric, we see no correlation between skill and delay, nor skill and 
the duration of a player’s game. We see some positive correlation {R — 0.378) 
between session duration and skill, so the more expert players do tend to stay 
longer. There is a slight negative correlation (i? = —0.231) between skill and 
delay, so a lower delay may lead to improved performance. 




(a) Skill versus Duration. 



(b) Skill versus Delay. 



Fig. 7. Effect of Skill. 



4.6 Social Bonds 

Figure |2(b)| indicates the presence of certain players who had excessively high 
delays, yet kept returning to the server. Here we analyse some of these players 
in more detail to see if it is possible to determine why they keep coming back. 
Table 0 shows some of these players’ statistics. 

One possibility is that these players are returning to the server because of 
friends who are also playing. However, of the 283 other players who were on the 
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Table 3. Detailed Statistics for Regular Players with High Delays. 



TLD (from 
whois) 


Number of 
visits 


Average 
delay (ms) 


Maximum 
delay (ms) 


Average du- 
ration (see) 


Maximum du- 
ration (sec) 


TW 


10 


423 


533 


390 


1460 


TH 


38 


794 


953 


2319 


10418 


KR 


14 


340 


389 


2477 


10043 


TW 


15 


554 


771 


665 


1659 


KR 


13 


339 


362 


970 


2553 



server at the same time as these five players, only 33 players appear twice and 
one appears three times. It is therefore unlikely that repeat visits or social bonds 
were the reason for these players returning. 



5 Conclusions, Caveats, and Future Work 

This study has looked at the effects of delay on user dynamics on a multiplayer 
games server. We find that application- level delay does not appear to have a 
significant effect on a user’s behaviour once they have chosen to connect to a 
games server. Although the majority of users have delays within the bounds 
predicted by previous VR and DIS studies, changes in this delay do not seem to 
lead to players aborting the game. 

Although brief and still at an interim stage, this study has raised a number 
of interesting research questions. If users are concerned with relative delay, is 
it possible to design efficient algorithms for determining the server with the 
smallest standard deviation in delay given a group of prospective users? There 
is currently a lot of interest in optimal placement of web and mirror servers 
e.g. m and perhaps this could be extended to locating game servers. 

That games players might be unaffected by sudden changes in delay has 
important implications for designing potential congestion control schemes for 
games. In particular, pricing schemes that depend on users adapting to network 
conditions because of price changes might be less practical. If users tend to 
remain in a game for an exogenously determined duration, then session-based 
pricing or reservations might make more sense from a user’s perspective. Pricing 
schemes for games could be designed to adapt the network to the user (who has 
already committed to playing a game), rather than the other way around. 

Usability studies of multiplayer networked FPS games, e.g. a GOMS (Goals, 
Operators, Methods and Selection) analysis such as that performed in 0 , might 
help to explain some of the results we have seen, since simple correlations of kills 
and deaths appear to be insufficient. More elaborate skill metrics, e.g. taking into 
account the amount of time between deaths, might also prove fruitful. 

As our results are only from the study of a single Half-Life server, which 
may not be representative of servers across the Internet as a whole, we intend 
to investigate lightweight methods for instrumenting larger numbers of servers 
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for future data collection. In spite of these limitations, this study has provided 
us with some direction for future experimental work. This study has only been 
correlational — we have examined and attempted to interpret results from an 
unmodified server. In future work we intend to run multiple servers, modifying 
variables such as network delay and jitter, and simulating different congestion 
control and QoS policies, to further investigate their effects on user behaviour. 
We expect to find our study corroborated by this further study. 

Acknowledgement 

Thanks to Saleem Bhatti, Jon Crowcroft, and the reviewers for their comments. 

References 

1. R. W. Bailey. Human Performance Engineering — Using Human Fae- 
tors/ Ergonomics to Achieve Computer System Usability. Prentice Hall, Englewood 
Cliffs, NJ, second edition, 1989. 

2. R. A. Bangun, E. Dutkiewicz, and G. J. Anido. An analysis of multi-player network 
games traffic. In Proceedings of the 1999 International Workshop on Multimedia 
Signal Processing, pages 3-8, Copenhagen, Denmark, Sept. 1999. 

3. Y. W. Bernier. Latency compensating methods in client/server in-game protocol 
design and optimization. In Proeeedings of the 15th Games Developers Conference, 
San Jose, CA, Mar. 2001. 

4. M. S. Borella. Source models of network game traffic. Computer Communications, 
23(4):403-410, Feb. 15, 2000. 

5. S. Cheshire. Latency and the quest for interactivity, Nov. 1996. White paper 
commissioned by Volpe Welty Asset Management, L.L.C., for the Synchronous 
Person-to-Person Interactive Computing Environments Meeting. 

6. L. Gautier and C. Diot. Design and evaluation of MiMaze, a multi-player game 
on the Internet. In Proceedings of the 1998 IEEE International Conference on 
Multimedia Computing and Systems, pages 233-236, Austin, TX, June 1998. 

7. T. Henderson and S. Bhatti. Modelling user behaviour in networked games. In 
Proceedings of the 9th ACM Multimedia Conference, Ottawa, Canada, Oct. 2001. 

8. Institute of Electrical and Electronic Engineers. 1278.2-1995, IEEE Standard for 
Distributed Interactive Simulation — Communication Services and Profiles. IEEE, 
New York, NY, Apr. 1996. 

9. B. E. John and A. H. Vera. A GOMS analysis of a graphic, machine-paced, highly 
interactive task. In Proceedings of the CHI’92 Conference on Human factors in 
computing systems, pages 251-258, Monterey, CA, May 1992. 

10. I. S. MacKenzie and C. Ware. Lag as a determinant of human performance in 
interactive systems. In Proeeedings of the CHI ’93 Conference on Human factors 
in computing systems, pages 488-493, Amsterdam, The Netherlands, Apr. 1993. 

11. S. McCreary and K. Claffy. Trends in wide area IP traffic patterns: A view from 
Ames Internet Exchange. In Proeeedings of the ITC Speeialist Seminar on IP 
Traffic Modeling, Measurement and Management, Monterey, CA, Sept. 2000. 

12. L. Qiu, V. N. Padmanabhan, and G. M. Voelker. On the placement of web server 
replicas. In Proceedings of the 20th IEEE Conference on Computer Communica- 
tions (INFOCOM), pages 1587-1596, Anchorage, AK, Apr. 2001. 

13. Valve Software. Half-Life, http://www.sierrastudios.com/games/half-life/. 




Application-Level Multicast Using 
Content-Addressable Networks 

Sylvia Ratnasamy^’^, Mark Handley^, Richard Karp^’^, and Scott Shenker^ 

^ University of California, Berkeley, CA, USA 
^ AT&T Center for Internet Research at ICSI 



Abstract. Most currently proposed solutions to application-level mul- 
ticast organise the group members into an application-level mesh over 
which a Distance- Vector routing protocol, or a similar algorithm, is used 
to construct source-rooted distribution trees. The use of a global routing 
protocol limits the scalability of these systems. Other proposed solutions 
that scale to larger numbers of receivers do so by restricting the mul- 
ticast service model to be single-sourced. In this paper, we propose an 
application-level multicast scheme capable of scaling to large group sizes 
without restricting the service model to a single source. Our scheme 
builds on recent work on Content-Addressable Networks (CANs). Ex- 
tending the CAN framework to support multicast comes at trivial addi- 
tional cost and, because of the structured nature of CAN topologies, ob- 
viates the need for a multicast routing algorithm. Given the deployment 
of a distributed infrastructure such as a CAN, we believe our CAN-based 
multicast scheme offers the dual advantages of simplicity and scalability. 



1 Introduction 

Several recent research proiects[?^ll)f7) propose designs for application-level net- 
works wherein nodes are structured in some well-defined manner. A Content- 
Addressable Networks (CANs) 0 is one such system. Briefiyll a Content-Address- 
able Network is an application-level network whose constituent nodes can be 
thought of as forming a virtual d-dimensional Cartesian coordinate space. Ev- 
ery node in a CAN “owns” a portion of the total space. For example, Figure [D 
shows a 2-dimensional CAN occupied by 5 nodes. A CAN, as described in |E|, is 
scalable, fault-tolerant and completely distributed. Such CANs are useful for a 
range of distributed applications and services. For example, in |0| we focus on 
the use of a CAN to provide hash table-like functionality on Internet-like scales 
- a function useful for indexing in peer-to-peer applications, large-scale storage 
management systems, the construction of wide-area name resolution services and 
so forth. 

This paper looks into the question of how the deployment of such CAN-like 
distributed infrastructures might be utilised to support multicast services and 
applications. We outline the design of an application-level multicast scheme built 

^ Section [^describes the CAN design in some detail. 
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using a CAN. Our design shows that extending the CAN framework to support 
multicast comes at trivial additional cost in terms of complexity and added 
protocol mechanism. A key feature of our scheme is that because we exploit 
the well-defined structured nature of CAN topologies (i.e. the virtual coordinate 
space) we can eliminate the need for a multicast routing algorithm to construct 
distribution trees. This allows our CAN-based multicast scheme to scale to large 
group sizes. While our design is in the context of CANs in particular, we believe 
our technique of exploiting the structure of these systems should be applicable 
to the Chord |S|, Pastry and Tapestry mu designs. 

In previous work, several research proposals have argued for application-level 
multicast as a more tractable alternative to a network-level multicast ser- 
vice and have described designs for such a service and its applications. The 
majority of these proposed solutions (for example [ I I4j ) typically involve hav- 
ing the members of a multicast group self-organise into an essentially random 
application-level mesh topology over which a traditional multicast routing algo- 
rithm such as DVMRP | 2 | is used to construct distribution trees rooted at each 
possible traffic source. Such routing algorithms require every node to maintain 
state for every other node in the topology. Hence, although these proposed solu- 
tions are well suited to their targeted applications 0 their use of a global routing 
algorithm limits their ability to scale to large (many thousands of nodes) group 
sizes and to operate under conditions of dynamic group membership. 

BayeuxJJ is an application-level multicast scheme that scales to large group 
sizes but restricts the service model to a single source. In contrast to the above 
schemes, CAN-based multicast can scale to large group sizes without restricting 
the service model to a single source. 

In summary, we believe our CAN-based multicast scheme offers two key ad- 
vantages: 

— CAN-based multicast can scale to very large {i.e. many thousands of nodes 
and higher) group sizes without restricting the service model to a single- 
source. To the best of our knowledge, no currently proposed application- level 
multicast scheme can operate in this regime. 

— Assuming the deployment of a CAN-like infrastructure, CAN-based multi- 
cast is trivially simple to achieve. This is not to suggest that CAN-based 
multicast by itself is either simpler or more complex than other proposed 
solutions to application- level multicast. Rather, our point is that CANs can 
serve as a building block in a range of Internet applications and services and 
that one such, easily achievable, service is application-level multicast. 

The remainder of this paper is organised as follows: Section |2| reviews the design 
and operation of a CAN. We describe the design of a CAN-based multicast 
service in Section 0 and evaluate this design through simulation in Section 0 
Finally, we discuss related work in Sectional and conclude. 

^ The authors in 0, state that End System Multicast is more appropriate for small, 
sparse groups as in audio- video conferencing and virtual classrooms, while the au- 
thors in fP apply their algorithm, Gossamer, to the self-organisation of infrastructure 
proxies. 
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2 Content-Addressable Networks 




Nodes. 

Fig. 2. Example 2- 
d Space before Node 1 
Joins. 




I's coordinate neighbor set = 12 4,7) 
Ts coordinate neighbor set = 11,2,4,5) 



Fig. 3. Example 2- 
d Space after Node 7 
Joins. 



In this Section, we present our design of a Content-Addressable Network. This 
paper gives only a brief overview of our CAN design; 0 presents the details and 
evaluation. 



2.1 Design Overview 

Our design centers around a virtual d-dimensional Cartesian coordinate space on 
a d-torus. 0 This coordinate space is completely logical and bears no relation to 
any physical coordinate system. At any point in time, the entire coordinate space 
is dynamically partitioned among all the nodes in the system such that every 
node “owns” its individual, distinct zone within the overall space. For example, 
Figuredshows a 2-dimensional [0, 1] x [0, 1] coordinate space partitioned between 
5 CAN nodes. This coordinate space provides us with a level of indirection, since 
one can now talk about storing content at a “point” in the space or routing 
between “points” in the space where a “point” refers to the node in the CAN 
that owns the zone enclosing that point. 

For example, this virtual coordinate space is used to store (key, value) pairs 
as follows: to store a pair (Ki,Vi), key Ki is deterministically mapped onto a 
point, say (x,y) in the coordinate space using a uniform hash function. The 
corresponding key-value pair is then stored at the node that owns the zone 
within which the point (x, y) lies. To retrieve an entry corresponding to key 
any node can apply the same deterministic hash function to map K\ onto point 
{x^y) and then retrieve the corresponding value from the point {x,y). If the 
point {x,y) is not owned by the requesting node or its immediate neighbours, 
the request must be routed through the CAN infrastructure until it reaches the 

For simplicity, the illustrations in this paper do not show a torus. 
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node in whose zone (x, y) lies. Efficient routing is therefore a critical aspect of 
our CAN. 

Nodes in the CAN self-organise into an overlay network that represents this 
virtual coordinate space. A node learns and maintains as its set of neighbours 
the IP addresses of those nodes that hold coordinate zones adjoining its own 
zone. This set of immediate neighbours serves as a coordinate routing table that 
enables routing between arbitrary points in the coordinate space. 

We first describe the three most basic pieces of our design: CAN routing, 
construction of the CAN coordinate overlay, and maintenance of the CAN overlay 
and then briefly discuss the simulated performance of our design. 

2.2 Routing in a CAN 

Intuitively, routing in a Content Addressable Network works by following the 
straight line path through the Cartesian space from source to destination coor- 
dinates. 

A CAN node maintains a coordinate routing table that holds the IP address 
and virtual coordinate zone of each of its neighbours in the coordinate space. In 
a d-dimensional coordinate space, two nodes are neighbours if their coordinate 
spans overlap along d— \ dimensions and abut along one dimension. For example, 
in Figure 13 node 5 is a neighbour of node 1 because its coordinate zone overlaps 
with I’s along the Y axis and abuts along the X-axis. On the other hand, node 6 is 
not a neighbour of 1 because their coordinate zones abut along both the X and Y 
axes. This purely local neighbour state is sufficient to route between two arbitrary 
points in the space: A CAN message includes the destination coordinates. Using 
its neighbour coordinate set, a node routes a message towards its destination 
by simple greedy forwarding to the neighbour with coordinates closest to the 
destination coordinates. FigureQ shows a sample routing path. 

For a d dimensional space partitioned into n equal zones, the average routing 
path length is thus (d/4)(n^/‘^) and individual nodes maintain 2d neighbours. 
These scaling results mean that for a d dimensional space, we can grow the 
number of nodes (and hence zones) without increasing per node state while the 
path length grows as 

Note that many different paths exist between two points in the space and 
so, even if one or more of a node’s neighbours were to crash, a node would 
automatically route along the next best available path. If however, a node loses 
all its neighbours in a certain direction, and the repair mechanisms described in 
Section E3I have not yet rebuilt the void in the coordinate space, then greedy 
forwarding may temporarily fail. In this case, a node may use an expanding ring 
search to locate a node that is closer to the destination than itself. The message 
is then forwarded to this closer node, from which greedy forwarding is resumed. 

2.3 CAN Construction 

As described above, the entire CAN space is divided amongst the nodes currently 
in the system. To allow the CAN to grow incrementally, a new node that joins 
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the system must be allocated its own portion of the coordinate space. This is 
done by an existing node splitting its allocated zone in half, retaining half and 
handing the other half to the new node. 

The process takes three steps: 

1. First the new node must find a node already in the CAN. 

2. Next, using the CAN routing mechanisms, it must find a node whose zone 
will be split. 

3. Finally, the neighbours of the split zone must be notified so that routing can 
include the new node. 



Bootstrap. A new CAN node first discovers the IP address of any node cur- 
rently in the system. The functioning of a CAN does not depend on the details 
of how this is done, but we use the same bootstrap mechanism as Yallcast and 
YOID P]. As in we assume that a CAN has an associated DNS domain name, 
and that this resolves to the IP address of one or more CAN bootstrap nodes. A 
bootstrap node maintains a partial list of CAN nodes it believes are currently in 
the system. Simple techniques to keep this list reasonably current are described 
in PJ. To join a CAN, a new node looks up the CAN domain name in DNS to 
retrieve a bootstrap node’s IP address. The bootstrap node then supplies the IP 
addresses of several randomly chosen nodes currently in the system. 



Finding a Zone. The new node then randomly chooses a point (a;, y) in the 
space and sends a JOIN request destined for point (x,y). This message is sent 
into the CAN via any existing CAN node. Each CAN node then uses the CAN 
routing mechanism to forward the message, until it reaches the node in whose 
zone (x,y) lies. 

This current occupant node then splits its zone in half and assigns one half to 
the new node. The split is done by assuming a certain ordering of the dimensions 
in deciding along which dimension a zone is to be split, so that zones can be 
re-merged when nodes leave. For a 2-d space a zone would first be split along 
the X dimension, then the Y and so on. The (key, value) pairs from the half zone 
to be handed over are also transfered to the new node. 



Joining the Renting. Having obtained its zone, the new node learns the 
IP addresses of its coordinate neighbour set from the previous occupant. This 
set is a subset of the the previous occupant’s neighbours, plus that occupant 
itself. Similarly, the previous occupant updates its neighbour set to eliminate 
those nodes that are no longer neighbours. Finally, both the new and old nodes’ 
neighbours must be informed of this reallocation of space. Every node in the 
system sends an immediate update message, followed by periodic refreshes, with 
its currently assigned zone to all its neighbours. These soft-state style updates 
ensure that all of their neighbours will quickly learn about the change and will 
update their own neighbour sets accordingly. Figures El and El show an example 
of a new node (node 7) joining a 2-dimensional CAN. 
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As can be inferred, the addition of a new node affects only a small number 
of existing nodes in a very small locality of the coordinate space. The number 
of neighbours a node maintains depends only on the dimensionality of the co- 
ordinate space and is independent of the total number of nodes in the system. 
Thus, node insertion affects only O (number of dimensions) existing nodes which 
is important for CANs with huge numbers of nodes. 

2.4 Node Departure, Recovery, and CAN Maintenance 

When nodes leave a CAN, we need to ensure that the zones they occupied are 
taken over by the remaining nodes. The normal procedure for doing this is for a 
node to explicitly hand over its zone and the associated (key, value) database to 
one of its neighbours. If the zone of one of the neighbours can be merged with 
the departing node’s zone to produce a valid single zone, then this is done. If 
not, then the zone is handed to the neighbour whose current zone is smallest, 
and that node will then temporarily handle both zones. 

The CAN also needs to be robust to node or network failures, where one 
or more nodes simply become unreachable. This is handled through a recovery 
algorithm, described in [0|, that ensures one of the failed node’s neighbours 
takes over the zone. 

2.5 Design Improvements and Performance 

Our basic CAN algorithm as described in the previous section provides a balance 
between low per-node state {0(d) for a d dimensional space) and short path 
lengths with 0{dn^/‘^) hops for d dimensions and n nodes. This bound applies 
to the number of hops in the CAN path. These are application level hops, not 
IP-level hops, and the latency of each hop might be substantial; recall that 
nodes that are adjacent in the CAN might be many miles (and many IP hops) 
away from each other. In 0, we describe a number of design techniques whose 
primary goal is to reduce the latency of CAN routing. Of particular relevance 
to the work in this paper, is a distributed “binning” scheme whereby co-located 
nodes on the Internet can be placed close by in the CAN coordinate space. In 
this scheme, every node independently measures its distance {i.e. latency) from 
a set of well known landmark machines and joins a particular portion of the 
coordinate space based on these measurements. Our simulation results in ^ 
indicate that these added mechanisms are very effective in reducing overall path 
latency. For example, we show that for a system with over 130,000 nodes, for a 
range of link delay distributions, we can route with a latency that is well within 
a factor of three of the underlying IP network latency. The number of neighbours 
that a node must maintain to achieve this is approximately 28 (details of this 
test are in Section 4 in jH]). 
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3 CAN-Based Multicast 

In this section, we describe a solution whereby CANs can be used to offer an 
application- level multicast service. 

If all the nodes in a CAN are members of a given multicast group, then 
multicasting a message only requires flooding the message over the entire CAN. 
As we shall describe in Section El we can exploit the existence of a well defined 
coordinate space to provide simple, efficient flooding algorithms from arbitrary 
sources without having to compute distribution trees for every potential source. 

If only a subset of the CAN nodes are members of a particular group, then 
multicasting involves two pieces: 

— the members of the group first form a group-specific ’’mini” CAN and then, 

— multicasting is achieved by flooding over this mini CAN 

In what follows, we describe the two key components of our scheme: group 
formation and multicast by flooding over the CAN. 
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3.1 Multicast Group Formation 

To assist in our explanation, we assume the existence of a CAN C within which 
a subset of the nodes wish to form a multicast group G. We achieve this by 
forming an additional mini CAN, call it Cg, made up of only the members of 
G. The underlying CAN C itself is used as the bootstrap for the formation 
of Cg as follows: using a well-known hash function, the group address G is 
deterministically mapped onto a point, say {x,y), and the node on C that owns 
the point {x,y) serves as the bootstrap node in the construction of Cg. Joining 
group G thus reduces to joining the CAN Cg. This is done by repeating the 
usual CAN construction process with (x,y) as the bootstrap node. Because of 
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the light-weight nature of the CAN bootstrap mechanisms, we do not expect 
the CAN bootstrap node to be overloaded by join requests. If this becomes a 
possibility however, one could use multiple bootstrap nodes to share the load 
by using multiple hash functions to deterministically map the group name G 
onto multiple points in the CAN C; the nodes corresponding to each of these 
points would then serve as a bootstrap node for the group G. As with the 
CAN bootstrap process, the failure of the bootstrap node(s) does not affect the 
operation of the multicast group itself; it only prevents new nodes from joining 
the group during the period of failure. 

Thus, every group has a corresponding CAN made up of all the group mem- 
bers. Note that with this group formation process a node only maintains state 
for those groups for which it is itself a member or for which it serves as the 
bootstrap node. For a d-dimensional CAN, a member node maintains state for 
2d additional nodes (its neighbours in the CAN), independent of the number of 
traffic sources in the multicast group. 



3.2 Multicast Forwarding 

Because all the members of group G (and no other node) belong to the as- 
sociated CAN Gg, multicasting to G is achieved by flooding on the CAN Gg. 
Different flooding algorithms are conceivable; for example, one might consider a 
naive flooding algorithm wherein a node caches the sequence numbers of mes- 
sages it has recently received. On receiving a new message, a node forwards the 
message to all its neighbours (except of course, the neighbour from which it re- 
ceived the message) only if that message is not already in its cache. With this 
type of floodcachesuppress algorithm a source can reach every group member 
with requiring a routing algorithm to discover the network topology. Such an 
algorithm does not make any special use of the CAN structure and could in fact 
be run over any application-level topology including a random mesh topology 
as generated in mu The problem with this type of naive flooding algorithm is 
that it can result in a large amount of duplication of messages; in the worst case, 
a node could receive a single message from each of its neighbours. 

A more efficient flooding solution would be to exploit the coordinate space 
structure of the CAN as follows: 

Assume that our CAN is a d-dimensional CAN with dimensions 1 . . .d. In- 
dividual nodes thus have at least 2d neighbours; 2 per dimension with one to 
move forward and another to move in reverse along each dimension, i.e. for every 
dimension i a node has at least one neighbour whose zone abuts its own own in 
the forward direction along i and another neighbour whose zone abuts its own 
in the reverse direction along i. For example, consider node A in Figure 0 node 
B abuts A in the reverse direction along dimension 1 while nodes C and D abut 
A in the forward direction along dimension 1. 

Messages are then forwarded as follows: 

1. The source node {i.e. node that generates a new message) forwards a message 
to all its neighbours 
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2. A node that receives a message from a neighbour with which it abuts along 
dimension i forwards the message to those neighbours with which it abuts 
along dimension 1 ... (z — 1) and the neighbours with which it abuts along 
dimension i on the opposite side to that from which it received the message. 
Figure 0 depicts this directed flooding algorithm for a 2-dimensional CAN. 

3. a node does not forward a message along a particular dimension if that 
message has already traversed at least half-way across the space from the 
source coordinates along that dimension. This rule prevents the flooding 
from looping round the back of the space. 

4. a node caches the sequence numbers of messages it has received and does 
not forward a message that it has already previously received 

For a perfectly partitioned (z. e. where nodes have equal sized zones) coor- 
dinate space, the above algorithm ensures that every node receives a message 
exactly once. For imperfectly partitioned spaces however, a node might receive 
the same message from more than one neighbour. For example, in Figure^ node 
E would receive a message from both neighbours C and D. 

Certain duplicates can be easily avoided because, under normal CAN op- 
eration, every node knows the zone coordinates for each of its neighbours. For 
example, consider once more Figure E] nodes C and D both know each others’ 
and node E's zone coordinates and could hence use a deterministic rule such that 
only one of them forwards messages to E. Such a rule, however, only eliminates 
duplicates that arise by flooding along the first dimension. The rule works along 
the first dimension because, all nodes forward along the first dimension. Hence 
even if a node, by applying some deterministic rule, does not forward a message 
to its neighbour along the first dimension, we know that some other node that 
does satisfy the deterministic rule will do so. But this need not be the case when 
forwarding along higher dimensions. Consider a 3-dimensional CAN; if a node 
by the application of a deterministic rule decides not to forward to a neighbour 
along the second dimension, there is no guarantee that any node will eventually 
forward it up along the second dimension because the node that does satisfy 
the deterministic rule might receive the packet along the first dimension and 
hence will not forward the message along the second dimension. 0 For example, 
in Figure 0 let us assume that node A decides (by the use of some deterministic 
rule) not to forward to node F. Because node C receives the message (from A) 
along the first dimension, it will not forward the message along the second di- 
mension either and hence node F and the other nodes with F"-axis coordinates 
in the same range as F, will never receive the message. While the above strategy 
does not eliminate all duplicates, it does eliminate a large fraction of it because 
most of the flooding occurs along the first dimension. Hence, we augment the 
above flooding algorithm with the following deterministic rule used to eliminate 
duplicates that arise from forwarding along the first dimension: 

— let us assume that a node, P, received a message along dimension 1 and that 
node Q abuts P along dimension 1 in the opposite direction from which P 

^ By the second rule in the flooding algorithm. 
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received the message. Consider the corner Cq of Q's zone that abuts P along 
dimension 1 and has the lowest coordinates along dimensions 2 . . .d. Then, 
P only forwards the message on to Q, if P is in contact with the corner Cq. 

So, for example, in Figure 01 with respect to nodes C and D, the corner under 
consideration for node E would be the lower, leftmost corner of E’s zone. Hence 
only D (and not C) would forward messages E in the forward direction along 
the first dimension. 

For the above flooding algorithm, we measured through simulation the per- 
centage of nodes that experienced different degrees of message duplication caused 
by imperfectly partitioned spaces. Figure |S| plots the number of nodes that re- 
ceived a particular number of duplicate messages for a system with 16,384 nodes 
using CANs with dimensions ranging from 2 to 6. In all cases, over 97% of the 
nodes receive no duplicate messages and amongst those nodes that do, virtually 
all of them receive only a single duplicate message. This is a considerable im- 
provement over the naive flooding algorithm wherein every node might receive 
a number of duplicates up to the degree (number of neighbours) of the node. 

It is worth noting that the naive flooding algorithm is very robust to message 
loss because a node can receive a message via any of its neighbours. However, 
the efficient flooding algorithm is less robust because the loss of a single mes- 
sage results in the breakdown of message delivery to several subsequent nodes 
thus requiring additional loss recovery techniques. This problem is however, no 
different than in the case of traditional IP multicast or other application-level 
schemes where the loss of a packet along a single link results in the packet being 
lost by all downstream nodes in the distribution tree. With both flooding algo- 
rithms, the duplication of messages arises because we do not (unlike most other 
solutions to multicast delivery) construct a single spanning tree rooted at the 
source of traffic. However, we believe that the simplicity and scalability gained 
by not having to run routing algorithms to construct and maintain such delivery 
trees is well worth the slight inefficiencies that may arise from the duplication 
of messages. 

Using the above flooding algorithm, any group member can multicast a mes- 
sage to the entire group. Nodes that are not group members can also multicast 
to the entire group by first discovering a random group member and relaying the 
transmission through this random group member. 0 This random member node 
can be discovered by contacting the bootstrap node associated with the group 
name. 

4 Performance Evaluation 

In this section, we evaluate, through simulation, the performance of our CAN- 
based multicast scheme. We adopt the performance metrics and evaluation strat- 
egy used in Pj. As with previous evaluation studies of application-level multicast 

® Note that relaying in our case is different from relayed transmissions as done in 
source specific multicast EH because only transmissions from non-member nodes 
are relayed and even these can be relayed through any member node. 
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Fig. 6. Cumulative Distribution of 
RDP. 
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Fig. 7. RDP versus Physical Delay 
for Every Group Member. 




Fig. 9. Cumulative Distribution 
of RDP Averaged over 100 Traffic 
Sources. 



scb ernes |4I1 II 1 1 we compare the performance of CAN-based multicast to native 
IP multicast and naive unicast-based multicast where the source simply unicasts 
a message to every receiver in succession. Our evaluation metrics are: 

— Relative Delay Penalty (RDP): the ratio of the delay between two nodes 
(in this case, the source node and a receiver) using CAN-based multicast to 
the unicast delay between them on the underlying physical network 

— Link Stress: the number of identical copies of a packet carried by a physical 
link 

Our simulations were performed on Transit-Stub (TS) topologies using the 
GT-ITM topology generator 0. TS topologies model networks using a 2-level 
hierarchy of routing domains with transit domains that interconnect lower level 
stub domains. 
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4.1 Relative Delay Penalty 

We first present results from a multicast transmission using a single source as 
this represents the performance typically seen across the different receiver nodes 
for a transmission from a single source. These simulations were performed using 
a CAN with 6 dimensions and a group size of 8192 nodes. The source node was 
selected at random. We used Transit-Stub topologies with link latencies of 20ms 
for intra-transit domain links, 5ms for stub-transit links and 2ms for intra-stub 
domain links. 

Both IP multicast and Unicast-based multicast achieve an RDP value of one 
for all group members because messages are transmitted along the direct physical 
(IP-level) path between the source and receivers. Routing on an overlay network 
however, fundamentally results in higher delays. Figure El plots the cumulative 
distribution of RDP over the group members. While the majority of receivers 
see an RDP of less than about 5 or 6, a few group members have a high RDP. 
This can be explained 0 from the scatter-plot in Figure 0 The figure plots the 
relation between the RDP observed by a receiver and its distance from the source 
on the underlying IP-level, physical network. Each point in Figure 0 indicates 
the existence of a receiver with the corresponding RDP and IP-level delay. As 
can be seen, all the nodes with high values of RDP have a low physical delay to 
the source, i.e. the very low delay from these receivers to the source inflates their 
RDP. However, the absolute value of their delay from the source on the CAN 
overlay is not really very high. This can be seen from Figure 0 which plots, for 
every receiver, its delay from the source using CAN multicast versus its physical 
network delay. The plot shows that while the maximum physical delay can be 
about 100ms, the maximum delay using CAN-multicast is about 600ms and the 
receivers on the left hand side of the graph, which had the high RDP, experience 
delays of not more than 300ms. 





Fig. 10. RDP versus Increasing Fig. 11. Number of Physical Links 
Group Size. with a Given Stress. 



The authors in 0 make the same observation and explanation. 
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The above results were all for a single multicast transmission using a single 
source; Figure 0 plots the cumulative distribution of the RDP with the delays 
averaged over multicast transmissions from a 100 sources selected at random. 
Because a node is unlikely to be very close (in terms of physical delay) to all 
100 sources, averaging the results over transmissions from many sources helps 
to reduce the appearance of inflated RDPs that occurs when a receiver is very 
close to the source. From Figure |3 we see that, on an average, no node sees an 
RDP of more than about 6.0. 

Finally, Figure E3 plots the 50 and 90 percentile RDP values for group sizes 
ranging from 128 to 65,000 for a single source. We scale the group size as follows: 
we take a 1,000 node Transit-Stub topology as before and to this topology, we 
add end-host (source and receiver) nodes to the stub (leaf) nodes in the topology. 
The delay of the link from the end-host node to the stub node is set to 1ms. 
Thus in scaling the group size from a 128 to 65K nodes, we’re scaling the density 
of the graph without scaling the backbone (transit) domain. So, for example, a 
group size of 128 nodes implies that approximately one in ten stub nodes has 
an associated group member while a group size of 65K implies that every stub 
node has approximately 65 attached end-host nodes. This method of scaling the 
graph causes the flat trend in the growth of RDP with group size because for 
a given source the relative number of close-by and distant nodes stays pretty 
much constant. Further, at high density, every CAN node has increasingly many 
close-by nodes and hence the CAN binning technique used to cluster co-located 
nodes yields higher gains. Different methods for scaling topologies could yield 
different scaling trends. 

While the significant differences between End-System Multicast and CAN- 
based multicast makes it hard to draw any direct comparison between the two 
systems; Figure mi indicates that the performance of CAN-based multicast even 
for small group sizes is competitive with End-System multicast. 

4.2 Link Stress 

Ideally, one would like the stress on the different physical links to be somewhat 
evenly distributed. Using native IP multicast, every link in the network has a 
stress of exactly one. In the case of unicasting from the source directly to all 
the receivers, links close to the source node have very high stress (equal to the 
group size at the first hop link from the source). Figure plots the number 
of nodes that experienced a particular stress value for a group size of 1024 for 
a 6-dimensional CAN. Unlike naive unicast where a small number of links see 
extremely high stress, CAN-based multicast distributes the stress much more 
evenly over all the links. 

Figure O plots the worst-case stress for group sizes ranging from 128 to 
65,000 nodes. The high stress in the case of large group sizes is because, as 
described earlier, we scale the group size without scaling the size of the backbone 
topology. For the above simulation, we used a transit-stub topology with a 1,000 
nodes. Hence for a group size of 65,000 nodes, all 65,000 nodes are interconnected 
by a backbone topology of less than 1,000 nodes thus putting high stress on some 
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backbone links. We repeated the above simulation for a transit-stub topology 
with 10,000 nodes, thus decreasing the density of the graph by a factor of 10. 
Figure ESI plots the worst-case stress for group sizes up to 2,048 nodes for all 
three cases {i.e. CAN-based multicast using Transit-Stub topologies with 1,000 
and 10,000 nodes and naive unicast-based multicast). As can be seen, at lower 
density the worst-case stress drops sharply. For example, at 2,048 nodes the 
worst case stress drops from 169 (for TSIOOO) to 37 (for TSIOOOO). Because, in 
practice, we do not expect very high densities of group member nodes relative to 
the Internet topology itself, worst-case stress using CAN-based multicast should 
be at a reasonable level. In future work, we intend looking into techniques that 
might further lower this stress value. 
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5 Related Work 

The case for application-level multicast as a more tractable alternative to a 
network-level multicast service was first put forth in 

The End-system multicast p[j work proposes an architecture for multicast 
over small and sparse groups. End-system multicast builds a mesh structure 
across participating end-hosts and then constructs source-rooted trees by run- 
ning a routing protocol over this mesh. The authors also study the fundamental 
performance penalty associated with such an application-level model. The au- 
thors in fP argue for infrastructure support to tackle the problem of content 
distribution over the Internet. The Scattercast architecture relies on proxies de- 
ployed within the network infrastructure. These proxies self-organise into an 
application-level mesh over which a global routing algorithm is used to construct 
distribution trees. In terms of being a solution to application- level multicast, the 
key difference between our work and the End-System multicast and Scattercast 
work is the potential for CAN-based multicast to scale to large group sizes. 
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Yoid 0 proposes a solution to application-level multicast wherein a span- 
ning tree is directly constructed across the participating nodes without first 
constructing a mesh structure. The resultant protocols are more complex be- 
cause the tree-first approach results in expensive loop detection and avoidance 
techniques and must be made resilient to partitions. 

Tapestry m is a wide-area overlay routing and location infrastructure that, 
like CANs, embeds nodes in a well-defined virtual address space. Bayeux El is 
a source-specific, application-level multicast scheme that leverages the Tapestry 
routing infrastructure. To join a multicast session, Bayeux nodes send JOIN 
messages to the source node. The source replies to a JOIN request by rout- 
ing a TREE message, on the Tapestry overlay, to the requesting node. This 
TREE message is used to set up state at intermediate nodes along the path 
from the source node to the new member. Similarly, a LEAVE message from an 
existing member triggers a PRUNE message from the root, which removes the 
appropriate forwarding state along the distribution tree. Bayeux and CAN-based 
multicast are similar in that they achieve scalability by leveraging the scalable 
routing infrastructure provided by systems like CAN and Tapestry. In terms 
of service model, Bayeux fundamentally supports only source-specific multicast 
while CAN-based multicast allows any group member to act as a traffic source. 
In terms of design, Bayeux uses an explicit protocol to set-up and tear down a 
distribution tree from the source node to the current set of receiver nodes. CAN- 
based multicast by contrast, fully exploits the CAN structure because of which 
messages can be forwarded without requiring a routing protocol to explicitly 
construct distribution trees. 

Overcast jS) is a scheme for source-specific, reliable multicast using an overlay 
network. Overcast constructs efficient dissemination trees rooted at the single 
source of traffic. The overlay network in Overcast is composed of nodes that 
reside within the network infrastructure. This assumption of the existence of 
permanent storage within the network distinguishes Overcast from CANs and 
indeed, from most of the other systems described above. Unlike Overcast, CANs 
can be composed entirely from end-user machines with no form of central au- 
thority. 

6 Conclusion 

Content-Addressable Networks have the potential to serve as an infrastructure 
that is useful across a range of applications. In this paper, we present and evalu- 
ate a scheme that extends the basic CAN framework to support application-level 
multicast delivery. There are, we believe, two key benefits to CAN-based multi- 
cast: the potential to scale to large groups without restricting the service model 
and the simplicity of the scheme under the assumption of the deployment of a 
distributed infrastructure such as a Content-Addressable Network. 

Our CAN-based multicast scheme is optimal in terms of the distance (in 
terms of path length) in flooding messages over the CAN overlay structure itself. 
In future work, we intend looking into simple clustering techniques to further 
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reduce the link stress caused by our flooding algorithm and understanding what 
the fundamental limitations there are. A number of important questions such 
as security, loss recovery, and congestion control remain to be addressed in the 
context of CAN-based multicast. 
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Abstract. This paper presents Scribe, a large-scale event notification 
infrastructure for topic-based publish-subscribe applications. Scribe sup- 
ports large numbers of topics, with a potentially large number of sub- 
scribers per topic. Scribe is built on top of Pastry, a generic peer-to- 
peer object location and routing substrate overlayed on the Internet, 
and leverages Pastry’s reliability, self-organization and locality proper- 
ties. Pastry is used to create a topic (group) and to build an efficient 
multicast tree for the dissemination of events to the topic’s subscribers 
(members). Scribe provides weak reliability guarantees, but we outline 
how an application can extend Scribe to provide stronger ones. 



1 Introduction 

Publish-subscribe has emerged as a promising paradigm for large-scale, Internet 
based distributed systems. In general, subscribers register their interest in a topic 
or a pattern of events and then asynchronously receive events matching their in- 
terest, regardless of the events’ publisher. Topic-based publish-subscribe ITEEl 
is very similar to group-based communication; subscribing is equivalent to be- 
coming a member of a group. For such systems the challenge remains to build 
an infrastructure that can scale to, and tolerate the failure modes of the general 
Internet. 

Techniques such as SRM (Scalable Reliable Multicast Protocol) Pj or RMTP 
(Reliable Message Transport Protocol) |5] have added reliability to network-level 
IP multicast m solutions. However, tracking membership remains an issue in 
router-based multicast approaches and the lack of wide deployment of IP multi- 
cast limits their applicability. As a result, application-level multicast is gaining 
popularity. Appropriate algorithms and systems for scalable subscription man- 
agement and scalable, reliable propagation of events are still an active research 
area IHIhllOlllj . 

Recent work on peer-to-peer overlay networks offers a scalable, self-organizing, 
fault-tolerant substrate for decentralized distributed applications |i2ii;iii4iib| . 



J. Crowcroft and M. Hofmann (Eds.): NGC 2001, LNCS 2233, pp. 30-|^| 2001. 
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Such systems offer an attractive platform for publish-subscribe systems that can 
leverage these properties. In this paper we present Scribe, a large-scale, decentral- 
ized event notification infrastructure built upon Pastry, a scalable, self-organizing 
peer-to-peer location and routing substrate with good locality properties m 
Scribe provides efficient application-level multicast and is capable of scaling to 
a large number of subscribers, publishers and topics. 

Scribe and Pastry adopt a fully decentralized peer-to-peer model, where 
each participating node has equal responsibilities. Scribe builds a multicast tree, 
formed by joining the Pastry routes from each subscriber to a rendez-vous point 
associated with a topic. Subscription maintenance and publishing in Scribe lever- 
ages the robustness, self-organization, locality and reliability properties of Pastry. 
Section 121 gives an overview of the Pastry routing and object location infrastruc- 
ture. Section 0 describes the basic design of Scribe and we discuss related work 
in Section 0 

2 Pastry 

In this section we briefly sketch Pastry m Pastry forms a secure, robust, self- 
organizing overlay network in the Internet. Any Internet-connected host that 
runs the Pastry software and has proper credentials can participate in the overlay 
network. 

Each Pastry node has a unique, 128-bit nodeld. The set of existing nodelds 
is uniformly distributed; this can be achieved, for instance, by basing the nodeld 
on a secure hash of the node’s public key or IP address. Given a message and 
a key. Pastry reliably routes the message to the Pastry node with a nodeld 
that is numerically closest to the key, among all live Pastry nodes. Assuming a 
Pastry network consisting of N nodes. Pastry can route to any node in less than 
\log 2 bN~\ steps on average (b is a configuration parameter with typical value 
4). With concurrent node failures, eventual delivery is guaranteed unless \l/2\ 
nodes with adjacent nodelds fail simultaneously {I is a configuration parameter 
with typical value 16). 

The tables required in each Pastry node have only (2^ — 1) * \log 2 i>N^ + 2Z 
entries, where each entry maps a nodeld to the associated node’s IP address. 
Moreover, after a node failure or the arrival of a new node, the invariants in 
all affected routing tables can be restored by exchanging 0 {log 2 bN) messages. 
In the following, we briefly sketch the Pastry routing scheme. A full description 
and evaluation of Pastry can be found in P2|. 

For the purposes of routing, nodelds and keys are thought of as a sequence of 
digits with base 2^ . A node’s routing table is organized into \log 2 bN^ rows with 
2** — 1 entries each. The 2^ — 1 entries in row n of the routing table each refer to 
a node whose nodeld matches the present node’s nodeld in the first n digits, but 
whose n + 1th digit has one of the 2^ — 1 possible values other than the n -I- 1th 
digit in the present node’s id. The uniform distribution of nodelds ensures an 
even population of the nodeld space; thus, only \log 2 bN"\ levels are populated 
in the routing table. Each entry in the routing table refers to one of potentially 
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Fig. 1. State of a hypothetical Pastry node with nodeld 10233102, 5 = 2. All 
numbers are in base 4. The top row of the routing table represents level zero. 
The neighborhood set is not used in routing, but is needed during node addi- 
tion /recovery. 

many nodes whose nodeld have the appropriate prefix. Among such nodes, the 
one closest to the present node (according to a scalar proximity metric, such as 
the delay or the number of IP routing hops) is chosen in practice. 

In addition to the routing table, each node maintains IP addresses for the 
nodes in its leaf set, i.e., the set of nodes with the 1/2 numerically closest larger 
nodelds, and the 1/2 nodes with numerically closest smaller nodelds, relative to 
the present node’s nodeld. Figure [D depicts the state of a hypothetical Pastry 
node with the nodeld 10233102 (base 4), in a system that uses 16 bit nodelds 
and a value of 5 = 2. 

In each routing step, a node normally forwards the message to a node whose 
nodeld shares with the key a prefix that is at least one digit (or b bits) longer 
than the prefix that the key shares with the present node’s id. If no such node 
is found in the routing table, the message is forwarded to a node whose nodeld 
shares a prefix with the key as long as the current node, but is numerically closer 
to the key than the present node’s id. Such a node must be in the leaf set unless 
the message has already arrived at the node with numerically closest nodeld 
or its neighbor. And, unless [|^|/2J adjacent nodes in the leaf set have failed 
simultaneously, at least one of those nodes must be live. 



2.1 Locality 

Next, we discuss Pastry’s locality properties, i.e., the properties of Pastry’s routes 
with respect to the proximity metric. The proximity metric is a scalar value that 
reflects the “distance” between any pair of nodes, such as the number of IP 
routing hops, geographic distance, delay, or a combination thereof. It is assumed 
that a function exists that allows each Pastry node to determine the “distance” 
between itself and a node with a given IP address. 
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We limit our discussion to two of Pastry’s locality properties that are relevant 
to Scribe. The first property is the total distance, in terms of the proximity 
metric, that messages are traveling along Pastry routes. Recall that each entry 
in the node routing tables is chosen to refer to the nearest node, according to the 
proximity metric, with the appropriate nodeld prefix. As a result, in each step 
a message is routed to the nearest node with a longer prefix match. Simulations 
show that, given a network topology based on the Georgia Tech model the 
average distance traveled by a message is less than 66% higher than the distance 
between the source and destination in the underlying Internet. 

Let us assume that two nodes within distance d from each other route mes- 
sages with the same key, such that the distance from each node to the node with 
nodeld closest to the key is much larger than d. The second locality property is 
concerned with the “distance” the messages travel until they reach a node where 
their routes merge. Simulations show that the average distance traveled by each 
of the two messages before their routes merge is approximately equal to the 
distance between their respective source nodes. These properties have a strong 
impact on the locality properties of the Scribe multicast trees, as explained in 
Section 0 

2.2 Node Addition and Failure 

A key design issue in Pastry is how to efficiently and dynamically maintain the 
node state, i.e., the routing table, leaf set and neighborhood sets, in the presence 
of node failures, node recoveries, and new node arrivals. The protocol is described 
and evaluated in m 

Briefly, an arriving node with the newly chosen nodeld X can initialize its 
state by contacting a nearby node A (according to the proximity metric) and 
asking A to route a special message using X as the key. This message is routed 
to the existing node Z with nodeld numerically closest to X. X then obtains 
the leaf set from Z, the neighborhood set from A, and the fth row of the routing 
table from the ith node encountered along the route from A to One can show 
that using this information, X can correctly initialize its state and notify nodes 
that need to know of its arrival, thereby restoring all of Pastry’s invariants. 

To handle node failures, neighboring nodes in the nodeld space (which are 
aware of each other by virtue of being in each other’s leaf set) periodically 
exchange keep-alive messages. If a node is unresponsive for a period T, it is 
presumed failed. All members of the failed node’s leaf set are then notified and 
they update their leaf sets to restore the invariant. Since the leaf sets of nodes 
with adjacent nodelds overlap, this update is trivial. A recovering node contacts 
the nodes in its last known leaf set, obtains their current leaf sets, updates its 
own leaf set and then notifies the members of its new leaf set of its presence. 
Routing table entries that refer to failed nodes are repaired lazily; the details 
are described in H2j. 
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2.3 Pastry API 

In this section, we briefly describe the application programming interface (API) 
exported by Pastry which is used in the Scribe implementation. The presented 
API is slightly simplified for clarity. Pastry exports the following operations: 

route(msg,key) causes Pastry to route the given message to the node with 
nodeld numerically closest to key, among all live Pastry nodes. 
send(msg,IP-addr) causes Pastry to send the given message to the node with 
the specified IP address, if that node is live. The message is received by that 
node through the deliver method. 

Applications layered on top of Pastry must export the following operations: 
deliver(msg,key) called by Pastry when a message is received and the local 
node’s nodeld is numerically closest to key among all live nodes, or when a 
message is received that was transmitted via send, using the IP address of the 
local node. 

forward(msg,key,nextId) called by Pastry just before a message is forwarded 
to the node with nodeld = nextid. The application may change the contents of 
the message or the value of nextid. Setting the nextid to NULL will terminate 
the message at the local node. 

In the following section, we will describe how Scribe is layered on top of the 
Pastry API. Other applications built on top of Pastry include PAST, a persistent, 
global storage utility jlYllSj . 

3 Scribe 

Any Scribe node may create a topic, other nodes can then register their interest 
in the topic and become a subscriber to the topic. Any Scribe node with the 
appropriate credentials for the topic can then publish events, and Scribe dissem- 
inates these events to all the topic’s subscribers. Scribe provides a best-effort 
dissemination of events, and specifies no particular event delivery order. How- 
ever, stronger reliability guarantees and ordered delivery for a topic can be built 
on top of Scribe, as outlined in Section tO Nodes can publish events, create and 
subscribe to many topics, and topics can have many publishers and subscribers. 
Scribe can support large numbers of topics with a wide range of subscribers per 
topic, and a high rate of subscriber turnover. 

Scribe offers a simple API to its applications: 
create(credentials, topicid) creates a topic with topicld. Throughout, the 
credentials are used for access control. 

subscribe(credentials, topicld, eventHandler) causes the local node to 
subscribe to the topic with topicld. All subsequently received events for that 
topic are passed to the specified event handler. 

unsubscribe(credentials, topicld) causes the local node to unsubscribe from 
the topic with topicld. 

publish(credentials, topicld, event) causes the event to be published in the 
topic with topicld. 
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Scribe uses Pastry to manage topic creation, subscription, and to build a per- 
topic multicast tree used to disseminate the events published in the topic. Pastry 
and Scribe are fully decentralized, all decisions are based on local information, 
and each node has identical capabilities. Each node can act as a publisher, a root 
of a multicast tree, a subscriber to a topic, a node within a multicast tree, and 
any sensible combination of the above. Much of the scalability and reliability of 
Scribe and Pastry derives from this peer-to-peer model. 

3.1 Scribe Implementation 

A Scribe system consists of a network of Pastry nodes, where each node runs 
the Scribe application software. The Scribe software on each node provides the 
forward and deliver methods, which are invoked by Pastry whenever a Scribe 
message arrives. The pseudo-code for these Scribe methods, simplified for clarity, 
is shown in Figure |2| and Figure 0 respectively. 



(1) forwardCmsg, key, nextid) 

(2) switch msg.type is 

(3) SUBSCRIBE : if ! (msg. topic £ topics) 

(4) topics = topics U msg. topic 

(5) msg. source = thisNodeld 

(6) route (msg, msg. topic) 

(7) topics [msg. topic] . children U msg. source 

(8) nextid = null // Stop routing the original message 



Fig. 2. Scribe Implementation of Forward. 



(1) deliver(msg,key) 

(2) switch msg.type 

(3) CREATE : 

(4) SUBSCRIBE : 

(5) PUBLISH : 

( 6 ) 

(7) 

( 8 ) 

(9) UNSUBSCRIBE 

( 10 ) 

( 11 ) 

( 12 ) 



is 

topics = topics U msg. topic 
topics [msg. topic] . children U msg. source 
V node in topics [msg. topic] . children 
send (msg, node) 
if subscribedTo (msg. topic) 

invokeEventHandler (msg. topic, msg) 

: topics [msg .topic] . children = 

topics [msg. topic] . children - msg. source 
if (|topics [msg. topic] . children] = 0) 
msg. source = thisNodeld 
send (msg, topics [msg. topic] .parent) 



Fig. 3. Scribe Implementation of Deliver. 



Recall that the forward method is called whenever a Scribe message is routed 
through a node. The deliver method is called when a Scribe message arrives at 
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the node with nodeld numerically closest to the message’s key, or when a message 
was addressed to the local node using the Pastry send operation. The possible 
message types in Scribe are SUBSCRIBE, create, unsubscribe and publish; 
the roles of these messages are described in the next sections. 

The following variables are used in the pseudocode: topics is the set of topics 
that the local node is aware of, msg. source is the nodeld of the message’s source 
node, msg. event is the published event (if present), msg. topic is the topicid of 
the topic and msg. type is the message type. 

Topic Management. Each topic has a unique topicid. The Scribe node with a 
nodeld numerically closest to the topicid acts as the rendez-vous point for the 
associated topic. The rendez-vous point forms the root of a multicast tree created 
for the topic. 

To create a topic, a Scribe node asks Pastry to route a create message 
using the topicid as the key (e.g. route(CREATE,topicId)). Pastry delivers this 
message to the node with the nodeld numerically closest to topicid. The Scribe 
deliver method adds the topic to the list of topics it already knows about (line 
3 of Figure ED. It also checks the credentials to ensure that the topic can be 
created, and stores the credentials in the topics set. This Scribe node becomes 
the rendez-vous point for the topic. 

The topicid is the hash of the topic’s textual name concatenated with its 
creator’s name. The hash is computed using a collision resistant hash function 
(e.g. SHA-1 which ensures a uniform distribution of topiclds. Since Pastry 
nodelds are also uniformly distributed, this ensures an even distribution of top- 
ics across Pastry nodes. A topicid can be generated by any Scribe node using 
only the textual name of the topic and its creator, without the need for an ad- 
ditional naming service. Of course, proper credentials are necessary to subscribe 
or publish in the associated topic. 

Membership Management. Scribe creates a multicast tree, rooted at the rendez- 
vous point, to disseminate the events published in the topic. The multicast tree 
is created using a scheme similar to reverse path forwarding m- The tree is 
formed by joining the Pastry routes from each subscriber to the rendez-vous 
point. Subscriptions to a topic are managed in a decentralized manner to support 
large and dynamic sets of subscribers. 

Scribe nodes that are part of a topic’s multicast tree are called forwarders 
with respect to the topic; they may or may not be subscribers to the topic. Each 
forwarder maintains a children table for the topic containing an entry (IP address 
and Nodeld) for each of its children in the multicast tree. 

When a Scribe node wishes to subscribe to a topic, it asks Pastry to route 
a SUBSCRIBE message with the topic’s topicid as the key (e.g. route (SUB- 
SCRiBE,topicId)). This message is routed by Pastry towards the topic’s rendez- 
vous point. At each node along the route. Pastry invokes Scribe’s forward method. 
Forward (lines 3 to 8 in Figure 0 ) checks its list of topics to see if it is currently 
a forwarder; if so, it accepts the node as a child, adding it to the children table. 
If the node is not already a forwarder, it creates an entry for the topic, and adds 
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the source node as a child in the associated children table. It then becomes a 
forwarder for the topic by sending a subscribe message to the next node along 
the route from the original subscriber to the rendez-vous point. The original 
message from the source is terminated; this is achieved by setting nextid = null, 
in line 8 of Figure 13 

Figure 2]illustrates the subscription mechanism. The circles represent nodes, 
and some of the nodes have their nodeld shown. For simplicity 6 = 1, so the 
prefix is matched one bit at a time. We assume that there is a topic with topicid 
1100 whose rendez-vous point is the node with the same identifier. The node 
with nodeld 0111 is subscribing to this topic. In this example. Pastry routes the 
SUBSCRIBE message to node 1001; then the message from 1001 is routed to 1101; 
finally, the message from 1101 arrives at 1100. This route is indicated by the 
solid arrows in Figured 




Fig. 4. Base Mechanism for Subscription and Multicast Tree Creation. 

Let us assume that nodes 1001 and 1101 are not already forwarders for topic 
1100. The subscription of node 0111 causes the other two nodes along the route 
to become forwarders for the topic, and causes them to add the preceding node 
in the route to their children tables. Now let us assume that node 0100 decides 
to subscribe to the same topic. The route that its subscribe message would 
take is shown using dot-dash arrows. Since node 1001 is already a forwarder, it 
adds node 0100 to its children table for the topic, and the SUBSCRIBE message 
is terminated. 

When a Scribe node wishes to unsubscribe from a topic, a node locally marks 
the topic as no longer required. If there are no entries in the children table, it 
sends a UNSUBSCRIPTION message to its parent in the multicast tree, as shown 
in lines 9 to 12 in Figure 0 The message proceeds recursively up the multicast 
tree, until a node is reached that still has entries in the children table after 
removing the departing child. It should be noted that nodes in the multicast 
tree are aware of their parent’s nodeld only after they have received an event 
from their parent. Should a node wish to unsubscribe before receiving an event, 
the implementation transparently delays the unsubscription until the first event 
is received. 

The subscriber management mechanism is efficient for topics with differ- 
ent numbers of subscribers, varying from one to all Scribe nodes. The list of 
subscribers to a topic is distributed across the nodes in the multicast tree. Pas- 
try’s randomization properties ensure that the tree is well balanced and that 
the forwarding load is evenly balanced across the nodes. This balance enables 
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Scribe to support large numbers of topics and subscribers per topics. Subscrip- 
tion requests are handled locally in a decentralized fashion. In particular, the 
rendez-vous point does not handle all subscription requests. 

The locality properties of Pastry (discussed in Section 12.11) ensure that the 
network routes from the root to each subscriber are short with respect to the 
proximity metric. In addition, subscribers that are close with respect to the 
proximity metric tend to be children of a parent in the multicast tree that is 
also close to them. This reduces stress on network links because the parent 
receives a single copy of the event message and forwards copies to its children 
along short routes. 

Event Dissemination. Publishers use Pastry to locate the rendez-vous point of 
a topic. If the publisher is aware of the rendez-vous point’s IP address then the 
PUBLISH message can be sent straight to the node. If the publisher does not know 
the IP address of the rendez-vous point, then it uses Pastry to route to that node 
(e.g. route(PUBLiSH, topicid)), and asks the rendez-vous point to return its IP 
address to the publisher. Events are disseminated from the rendez-vous point 
along the multicast tree in the obvious way (lines 5 and 6 of Figure |3|). 

The caching of the rendez-vous point’s IP address is an optimization, to avoid 
repeated routing through Pastry. If the rendez-vous point fails then the publisher 
can route the event through Pastry and discover the new rendez-vous point. If 
the rendez-vous point has changed because a new node has arrived, then the old 
rendez-vous point can forward the publish message to the new rendez-vous point 
and ask the new rendez-vous point to forward its IP address to the publisher. 

There is a single multicast tree for each topic and all publishers use the above 
procedure to publish events. This allows the rendez-vous node to perform access 
control. 

3.2 Reliability 

Publish/subscribe applications may have diverse reliability requirements. Some 
topics may require reliable and ordered delivery of events, whilst others require 
only best-effort delivery. Therefore, Scribe provides only best-effort delivery of 
events but it offers a framework for applications to implement stronger reliability 
guarantees. 

Scribe uses TCP to disseminate events reliably from parents to their children 
in the multicast tree, and it uses Pastry to repair the multicast tree when a 
forwarder fails. 

Repairing the Multicast Tree. Periodically, each non-leaf node in the tree sends 
a heartbeat message to its children. When events are frequently published on a 
topic, most of these messages can be avoided since events serve as an implicit 
heartbeat signal. A child suspects that its parent is faulty when it fails to receive 
heartbeat messages. Upon detection of the failure of its parent, a node calls 
Pastry to route a SUBSCRIBE message to the topic’s identifier. Pastry will route 
the message to a new parent, thus repairing the multicast tree. 
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For example, in FigureEl consider the failure of node 1101. Node 1001 detects 
the failure of 1101 and uses Pastry to route a subscribe message towards the 
root through an alternative route. The message reaches node 1111, which adds 
1001 to its children table and, since it is not a forwarder, sends a subscribe 
message towards the root. This causes node 1100 to add 1111 to its children 
table. 

Scribe can also tolerate the failure of multicast tree roots (rendez-vous points) . 
The state associated with the rendez-vous point, which identifies the topic cre- 
ator and has an access control list, is replicated across the k closest nodes to 
the root node in the nodeld space (where a typical value of A: is 5). It should be 
noted that these nodes are in the leaf set of the root node. If the root fails, its 
immediate children detect the failure and subscribe again through Pastry. Pas- 
try routes the subscriptions to a new root (the live node with the numerically 
closest nodeld to the topicid), which takes over the role of the rendez-vous point. 
Publishers likewise discover the new rendez-vous point by routing via Pastry. 

Children table entries are discarded unless they are periodically refreshed by 
an explicit message from the child, stating its continued interest in the topic. 

This tree repair mechanism scales well: fault detection is done by sending 
messages to a small number of nodes, and recovery from faults is local; only a 
small number of nodes { 0 {log 2 bN)) is involved. 



Providing Additional Guarantees. By default. Scribe provides reliable, ordered 
delivery of events only if the TCP connections between the nodes in the multicast 
tree do not break. For example, if some nodes in the multicast tree fail. Scribe 
may fail to deliver events or may deliver them out of order. 

Scribe provides a simple mechanism to allow applications to implement 
stronger reliability guarantees. Applications can define the following upcall meth- 
ods, which are invoked by Scribe. 

forwardHandler(msg) is invoked by Scribe before the node forwards an event, 
msg, to its children in the multicast tree. The method can modify msg before it 
is forwarded. 

subscribeHandler(msg) is invoked by Scribe after a new child is added to one 
of the node’s children tables. The argument is the SUBSCRIBE message. 
faultHandler(msg) is invoked by Scribe when a node suspects that its parent 
is faulty. The argument is the SUBSCRIBE message that is sent to repair the tree. 
The method can modify msg to add additional information before it is sent. 

For example, an application can implement ordered, reliable delivery of events 
by defining the upcalls as follows. The forwardHandler is defined such that the 
root assigns a sequence number to each event and such that recently published 
events are buffered by the root and by each node in the multicast tree. Events 
are retransmitted after the multicast tree is repaired. The faultHandler adds the 
last sequence number, n, delivered by the node to the subscribe message and 
the subscribeHandler retransmits buffered events with sequence numbers above 
n to the new child. To ensure reliable delivery, the events must be buffered for 
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an amount of time that exceeds the maximal time to repair the multicast tree 
after a TCP connection breaks. 

To tolerate root failures, the root needs to be replicated. For example, one 
could choose a set of replicas in the leaf set of the root and use an algorithm like 
Paxos m to ensure strong consistency. 

4 Related Work 

Like Scribe, Overcast and Narada implement multicast using a self- 
organizing overlay network, and they assume only unicast support from the 
underlying network layer. Overcast builds a source-rooted multicast tree using 
end-to-end bandwidth measurements to optimize bandwidth between the source 
and the various group members. Narada uses a two step process to build the 
multicast tree. First, it builds a mesh per group containing all the group mem- 
bers. Then, it constructs a spanning tree of the mesh for each source to multi- 
cast data. The mesh is dynamically optimized by performing end-to-end latency 
measurements and adding and removing links to reduce multicast latency. The 
mesh creation and maintenance algorithms assume that all group members know 
about each other and, therefore, do not scale to large groups. 

Scribe builds a multicast tree on top of a Pastry network, and relies on Pastry 
to optimize route locality based on a proximity metric (e.g. IP hops or latency). 
The main difference is that the Pastry network can scale to an extremely large 
number of nodes because the algorithms to build and maintain the network 
have space and time costs of 0{log2bN). This enables support for extremely 
large groups and sharing of the Pastry network by a large number of groups. 

The recent work on Bayeux m is the most similar to Scribe. Bayeux is built 
on top of a scalable peer-to-peer object location system called Tapestry m 
(which is similar to Pastry). Like Scribe, it supports multiple groups, and it 
builds a multicast tree per group on top of Tapestry but this tree is built quite 
differently. Each request to join a group is routed by Tapestry all the way to 
the node acting as the root. Then, the root records the identity of the new 
member and uses Tapestry to route another message back to the new member. 
Every Tapestry node (or router) along this route records the identity of the new 
member. Requests to leave a group are handled in a similar way. 

Bayeux has two scalability problems when compared to Scribe. Firstly, it 
requires nodes to maintain more group membership information. The root keeps 
a list of all group members, the routers one hop away from the route keep a list 
containing on average members (where b is the base used in Tapestry routing), 
and so on. Secondly, Bayeux generates more traffic when handling group mem- 
bership changes. In particular, all group management traffic must go through 
the root. Bayeux proposes a multicast tree partitioning mechanism to amelio- 
rate these problems by splitting the root into several replicas and partitioning 
members across them. But this only improves scalability by a small constant 
factor. 
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In Scribe, the expected amount of group membership information kept by 
each node is small, as the subscribers are distributed over the nodes. Addition- 
ally, group join and leave requests are handled locally. This allows Scribe to scale 
to extremely large groups and to deal with rapid changes in group membership 
efficiently. 

The mechanisms for fault resilience in Bayeux and Scribe are also very differ- 
ent. All the mechanisms for fault resilience proposed in Bayeux are sender-based 
whereas Scribe uses a receiver-based mechanism. In Bayeux, routers proactively 
duplicate outgoing packets across several paths or perform active probes to select 
alternative paths. Both these schemes have some disadvantages. The mechanisms 
that perform packet duplication consume additional bandwidth, and the mech- 
anisms that select alternative paths require replication and transfer of group 
membership information across different paths. Scribe relies on heartbeats sent 
by parents to their children in the multicast tree to detect faults, and children use 
Pastry to reroute to a different parent when a fault is detected. Additionally, 
Bayeux does not provide a mechanism to handle root failures whereas Scribe 
does. 

5 Conclusions 

We have presented Scribe, a large-scale and fully decentralized event notifica- 
tion system built on top of Pastry, a peer-to-peer object location and routing 
substrate overlayed on the Internet. Scribe is designed to scale to large numbers 
of subscribers and topics, and supports multiple publishers per topic. 

Scribe leverages the scalability, locality, fault-resilience and self-organization 
properties of Pastry. Pastry is used to maintain topics and subscriptions, and to 
build efficient multicast trees. Scribe’s randomized placement of topics and mul- 
ticast roots balances the load among participating nodes. Furthermore, Pastry’s 
properties enable Scribe to exploit locality to build efficient multicast trees and 
to handle subscriptions in a decentralized manner. 

Fault-tolerance in Scribe is based on Pastry’s self-organizing properties. Scribe’s 
default reliability scheme ensures automatic adaptation of the multicast tree to 
node and network failures. Event dissemination is performed on a best-effort ba- 
sis; consistent ordering of delivered events is not guaranteed. However, stronger 
reliability models can be layered on top of Scribe. 

Simulation results, based on a realistic network topology model and presented 
in 123 , indicate that Scribe scales well. It efficiently supports a large number 
of nodes, topics, and a wide range of subscribers per topic. Hence, Scribe can 
concurrently support applications with widely different characteristics. Results 
also show that it balances the load among participating nodes, while achieving 
acceptable delay and link stress, when compared to network- level (IP) multicast. 
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Abstract. Gossip-based protocols have received considerable attention 
for broadcast applications due to their attractive scalability and relia- 
bility properties. The reliability of probabilistic gossip schemes studied 
so far depends on each user having knowledge of the global membership 
and choosing gossip targets uniformly at random. The requirement of 
global knowledge is undesirable in large-scale distributed systems. 

In this paper, we present a novel peer-to-peer membership service which 
operates in a completely decentralized manner in that nobody has global 
knowledge of membership. However, membership information is repli- 
cated robustly enough to support gossip with high reliability. Our scheme 
is completely self-organizing in the sense that the size of local views natu- 
rally converges to the ‘right’ value for gossip to succeed. This ‘right’ value 
is a function of system size, but is achieved without any node having to 
know the system size. We present the design, theoretical analysis and 
preliminary evaluation of Scamp. Simulations show that its performance 
is comparable to that of previous schemes which use global knowledge of 
membership at each node. 

Keywords: Scalability, reliability, peer-to-peer, gossip-based probabilis- 
tic multicast, membership, group communication, random graphs. 



1 Introduction 

The demand for large-scale event dissemination in distributed systems is grow- 
ing rapidly but traditional network-level protocols and broadcast algorithms do 
not scale to more than thousands of participants 0. Techniques such as SRM 
(Scalable Reliable Multicast Protocol) ^ or RMTP (Reliable Message Transport 
Protocol) |S| have added reliability to network-level IP multicast m solutions, 
using acknowledgments and repair mechanisms. However, no feature is available 
for membership tracking in network-level multicast approaches and their appli- 
cability is limited by the lack of wide deployment of IP multicast. As a result, 
application-level multicast, and in particular, gossip-based broadcast algorithms, 
have recently emerged as an attractive alternative. Probabilistic versions of these 
have received much attention and provide good scalability and reliability proper- 
ties mm- Their scalability relies on a peer-to-peer interaction model, where 
each participating node is in charge of a part of the dissemination process: the 
first time a node receives each notification, it forwards it to a random subset of 
other nodes (see the next Section for details) . The protocols incorporate redun- 
dant messages which make them highly resilient to failures. 
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Though the above gossip-based approaches have proven scalable, they rely 
on a non-scalable membership protocol: they assume that the subset of nodes 
is chosen uniformly among all participating nodes, requiring that each node 
should know every other node. This imposes high requirements on memory and 
synchronization, which adversely affects their scalability. This has motivated 
work on distributing membership management [mnj in order to provide each 
node with a partial random view of the system without any node having global 
knowledge of the membership. 

Our understanding of scalable membership protocol should not be confused 
with that of □> ED where the aim is to provide each member of the group with 
an accurate and timely global view of the membership. The problem we consider 
is instead to provide each node with partial membership information which is 
sufficient to achieve reliable dissemination using a traditional gossip-based pro- 
tocol. One approach to this issue is presented in m, where a connection graph 
called a Harary graph is constructed. Optimality properties of Harary graphs 
ensure a good trade-off between the number of messages propagated and the 
reliability guarantees. However, building such a graph requires global knowl- 
edge of membership, and maintaining such a graph structure in the presence of 
arrivals/departures of nodes might prove difficult. 

A protocol that does not rely on global knowledge of membership is Lpbcast 
0. However, the size of the partial view and the number of gossip targets are 
fixed a priori^ which precludes decentralized adaptation to changes in system 
size. 

We seek to provide a fully decentralized membership scheme, which meets 
the following goals: nodes obtain a partial view that adapts automatically to 
system size, and the view size is tuned to support gossip-based dissemination. 
In earlier work 0, we derived the fanout (number of gossip targets) required, as 
a function of system size, in order to achieve reliability. When the membership 
management is centralized or distributed among a few servers, the number of 
participants is easily determined, and the fanout can be adjusted to match reli- 
ability requirements. However, in a fully decentralized model, where each node 
operates with an incomplete view of the system, this is no longer straightforward. 

We propose a novel probabilistic scalable membership protocol (Scamp) 
aimed at addressing this problem. Scamp is very simple, fully decentralized and 
self-configuring. As the number of participating nodes, n, increases, we show both 
analytically and through simulation that the size of local views automatically 
adapts to the desired value of (c -I- 1) logn. Here, c is a design parameter which 
specifies the degree of robustness to failures: it follows from 0 Theorem 1] that 
any proportion of failed links up to c/(c -I- 1) can be tolerated when the fanout 
is set to (c -I- 1) logn. Preliminary evaluation results show that gossip based on 
the partial views provided by Scamp is as resilient to failures as gossip based 
on random choice from a global membership known at each node. Scamp can 
potentially be incorporated in existing gossip-based schemes to reduce memory 
and synchronization overhead due to membership management. 

The remainder of the paper is organized as follows. We describe Scamp 
in Section El The theoretical analysis is presented in Section El and simulation 
results in Section 01 We conclude in Section 0 
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2 Scamp: Peer-to-Peer Lightweight Membership Service 
for Large-Scale Group Communication 

In this section we present the system model and the algorithms of Scamp. The 
scalability of the algorithm relies on its peer to peer communication model be- 
tween nodes for both membership management and gossip dissemination. We 
have designed Scamp to achieve partial views of just the right size to be re- 
silient to a given fraction of failures. This presupposes that nodes gossip to all 
nodes in their partial view. They could choose to gossip to a randomly chosen 
subset instead at the cost of reducing the fraction of failures tolerated. 

In gossip-based protocols, notifications are propagated as follows. When a 
node generates a notification, it sends it to a random subset of other nodes. 
When any node receives a notification for the first time, it does the same. The 
question is how large this random subset should be chosen in order for all nodes 
to receive the notification with high probability. In earlier work jS|, we proved 
the following result. If there are n nodes, and each node gossips to logn -|- s 
other nodes on average, then the probability that everyone gets the notification 
converges to exp(— e“®). In other words, there is a sharp threshold at logn: the 
probability of success (everyone receiving the notification) is close to one if each 
node gossips to slightly more than logn nodes and close to zero if each node 
gossips to slightly fewer than log n nodes. We also derived expressions for how 
the success probability depends on the failure rate of nodes and links. 

Previous work on gossip-based protocols has relied on each node having 
knowledge of the global membership list so that gossip targets can be chosen 
uniformly at random from all members. In |S|, we proposed a scheme whereby 
a set of servers maintains the global membership list and provides individual 
nodes with a randomized partial view. Thus, nodes don’t all need to have global 
information, but simply gossip to everyone in their local list. In the present work, 
we eliminate the need for servers and describe a fully decentralized scheme which 
achieves the same goals: nodes obtain a randomized partial view of the system, 
and the size of this view automatically scales correctly with system size, even 
though no node knows the system size. We now describe the details of this 
scheme. 



2.1 Membership Management in Scamp 

Subscription. New nodes join the group by sending a subscription request to an 
arbitrary member. They start with a local view consisting of just the member 
to whom they sent their subscription request. When a node receives a new sub- 
scription request, it forwards the new node-id to all members of its own local 
view. It also creates c additional copies of the new subscription (c is a design pa- 
rameter that determines the proportion of failures tolerated) and forwards them 
to randomly chosen nodes in its local view. When a node receives a forwarded 
subscription it integrates the new subscriber in its view with a probability 
p which depends on the size of its view. If it doesn’t keep the new subscriber, 
it forwards the subscription to a node randomly chosen from its local view. The 
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system configures itself towards views of size (c + 1) log(n) on average, n being 
the number of nodes in the system. 

Algorithm^depicts the pseudo-code for a node receiving a new subscription. 
Algorithm 0 depicts the pseudo-code for a node receiving a forwarded subscrip- 
tion. 



1 Subscription Management 

Upon subscription(s) of a new subscriber 

{7?ie subscription of s is forwarded to all the nodes of view} 
for (i=0; i< view. Count-, i++) do 
{For each node n in View} 

Send(uiew[i] ,s,forwardedSusbcription) ; 
end for 

{c additional copies of the subscription s are forwarded to random 

nodes of view} 

for (j=0; j< c; j++) do 

randomNode=RandomChoice (view . Count ) ; 

Send(.view[randomNode] ,s,forwardedSusbcription) ; 
end for 



2 Handling of a Forwarded Subscription 

{A node receiving a forwarded subscription adds it with the probability 
p — 1 / [1 -\- sizeO f {y iew)) if it doesn’t have it already} 

{It forwards the subscription to a node randomly chosen in its list if 
it does not keep it 
keep=RandomChoiceBetweenOandl () 
keep=Math . Floor ( (view . Count+1) *keep) ; 
if (keep==0) and s ^ view then 
view. Add(s) ; 
else 

int i=RandomChoice (view. Count) ; 
n=view [i] ; 

send (n , s , f orwardedSusbcr ipt ion) ; 
end if 



Note that our membership protocol creates a distribution graph which en- 
sures that every node is connected. This implies that, in the absence of failures 
or unsubscriptions, the dissemination of messages is fully reliable. 



Unsubscriptions. Unsusbcriptions are handled as a gossip message and are dis- 
seminated to all members of the group. Any node that has the unsubscribing 
node in its partial view deletes it on receiving the unsubscription message. 
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Recovery from Isolation. A node becomes isolated when its identifier is present 
in no local views because, for example, all nodes holding its identifier have either 
failed or unsubscribed. Such a node has a substantial probability of remaining 
isolated for a long time. To overcome this problem, we propose a periodic check 
mechanism performed by isolated nodes. A node which has not received messages 
for a given period (the period is chosen to be much larger than the average time 
between messageqj) will resubscribe through an arbitrary node in its partial 
view. 

3 Analysis 

We now present the theoretical analysis of the algorithm described above. We 
model the system as a random directed graph: nodes correspond to subscribers 
and there is a directed arc {x, y) whenever y is in the local list of afl. 

When a new node subscribes, the action of our algorithm is to create a 
random number of additional arcs, as follows. Suppose there are n members 
already in the group. If the new node subscribes to a node with out-degree d, 
then d-|-c-|-l arcs are added. The new node has out-degree 1, with list consisting 
of just the node it subscribed to. The node receiving the subscription forwards 
one copy of the node-id of the subscribing node to each of its neighbors, and an 
additional c copies to randomly chosen neighbors. These forwarded subscriptions 
may be kept by the neighbors or forwarded, but are not destroyed until some node 
keeps them. No node keeps multiple copies of the same subscription. In practice, 
each node chooses whether to keep a forwarded subscription with probability 
inversely proportional to the length of its current list. For ease of analysis, we’ll 
assume that new arcs are added by choosing nodes uniformly at random without 
replacement. 

Let Mn denote the number of arcs when the number of nodes has grown to 
n, so that the average out-degree of each node is Mn/n. We have 



from which we find that EMn « (c -I- l)nlogn. If in fact Mn = {c+ I)nlogn, 
and the arcs are distributed uniformly at random among the nodes, then it was 
shown in |Hl Theorem 1] that the probability of a gossip being successful is very 
nearly 1 if the link failure probability is smaller than c/(c-|- 1). We shall now 
bound the deviation of the random quantity M„ from its mean, and show that, 
with high probability, is very close to (c -I- l)nlogn. In other words, the 
proposed membership management scheme achieves the desired out-degree with 

^ To facilitate this, we ensure that heartbeat messages are sent if no message has been 
sent within this period. 

^ Note that the graph represents the logical relation of membership in local views 
rather than the physical topology of the underlying network. The validity of the 
random graph model thus relies on the way in which the membership lists are created 
and is not dependent on the graph structure of the physical network. 
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high probability, with no centralized control or even knowledge of the size of the 
group. 

Let Tn denote the cr-algebra corresponding to the sequence of random graphs 
created after each of the first n nodes joined the group. We shall show that 

Mn \ ^ c + 1 

■^n ■— / . ^ 

n ^ ' i 

i—1 

is a martingale. By the assumption that new nodes subscribe to a randomly 
chosen member node, we have 



E[Mn+l\!F n] — Mn H h C + 1, 

n 

from which it follows that E[Xn+i\Tn\ = AT„, i.e., is a martingale. We now 
estimate its variance. 

Let 7T„ denote the empirical distribution of node out-degrees conditional on 
Tn- The subscription goes to a random node whose out-degree, denoted dn, is 
a random draw from 7t„. Now, dn + c copies of the subscription are forwarded, 
and are eventually kept by nodes chosen uniformly at random (without replace- 
ment 0- Let di,... ,dd^+c denote the out-degrees of these nodes. The new em- 
pirical distribution is 



.. / d„+c \ 

+ 1 ) = ^ ^ 

where Sk denotes unit mass at k. Let /„ and denote the expected mean 
and second moment of 7 t„, which is a random probability distribution. Let 
hn = E[didj] where di and dj are the out-degrees of two distinct nodes cho- 
sen uniformly at random. Let Wn denote the expected second moment of the 
total number of edges, Mn- 

Observe that M„+i = M„-|-l-l-(d„-|-c), and so (n-|- l)/n-i-i = u/n-|- H-/n + c, 

i.e., 

fn+l = fn + {c+ 1)/ (n -I- 1). (2) 



Moreover, 

Wn+l = Wn + E[dn + 2(1 -|- c)d„ -|- (1 -k c)^] -|- 2(1 -|- c)E[Mn] + 2E[dnMn] 

n 

= Wn Vn ~\~ 2(\ c)fn + (1 + c)^ -|- 2(1 -|- c)n/„ -f 2 ^ ( E[dndi] 

i=l 

= Wn + 3Un -k 2(n — l)/l„ -k 2(1 -k c)(n -k l)/n + (1 + c)^. 



® In fact, our algorithm stores subscriptions preferentially in nodes with smaller out- 
degree. Ignoring this increases the variance of out-degrees and so the conclusions 
from the analysis presented here are expected to hold a fortiori for our algorithm. 
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We also have from Q that 

n 

+ 1)'+ E 

dn-\-C 

^ ^ di + dfi H" 

= r(l “1“ '^'^n ‘^E^[didn\ 2c£'[d-n,] H- H- c) 

n 1 

n 2 2c+l„ c+1 

= — TT' 

n+1 n+1 n+1 n+1 

We can eliminate hn from the two equations above using the fact that 




1 



dn+C 



Vn+l = I 1 + 

V i=i 

n 

1 + X! 



n -I- 1 
1 

n. A- 1 



Wn = E[Ml] = E[(^ di)"^] = nvn + n{n - l)/i„ 
i=l 

from which it follows that 



hn, — 



Wn - nVn 



n(n — 1) 

Substituting this above and simplifying, we get the recursions: 



Wn+l 






H ^ Wn Vn 2(1 -|- c){n l)fn + (1 + c)^ 

n — 2 2 2c -1-1 c-l-1 

n — 1 (n — l)n(n -I- 1) n -I- 1 n -I- 1 



(3) 

(4) 



Let jn = Wn — ri^fn denote the variance of M„, and let r]n = Vn~ fn denote 
the expected variance of the random distribution 7r„. From the above, we obtain 
the following recursions for and rjn'. 



'In+l = [l+-]'y„ 



Vn+l — 



n — 2 



-Vn 



:ln 



n— 1'"‘ ' (n — l)n(n -I- 1) ' n -I- 1 

Iterating O, we obtain the expression 



ifn - fn) + 



C-l-1 
n -I- 1 




n— 1 



7 „ = n{n+ 1) ^ 



Vk 



k=0 



[k -|- 1) (A: -|- 2) 



(7) 



We substitute this in 



n — 2 

Vn+l < -V-, 

n — 1 



and use straightforward bounds to obtain the inequality 

n—l ^ 2 



E 



Vk 



log n 



«-l ^ {k+l){k + 2) 



n 



( 8 ) 



valid for all n > 2, where k is a suitably chosen constant (the expression for /„ 
entails for instance that k = 3(c -I- 1)^-1- (c-l-1)/ log^ 2 would suffice) . 

We now establish the following result. 
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Lemma 1. There exists a constant R > 0 such that, for all n > 2, 



Vn < -Rlog n. 



(9) 



Proof: Assume that we have found a constant R such that the desired inequality 
is satisfied for all k in the range {2, ... ,n}. In view of (0, we obtain 



2 R . 2 2R ^ 

Vn+i < Rlog n log n H > 

T). — 1 n — 1 



log^k 



n — 1 (fc + l)(/c + 2) n — 1 



Vo + m , log n 

K . 



Splitting the second term into two halves, and introducing the notation 

log^ k 



C = 2Y, 



k>2 



(fc + l)(fc + 2)’ 



we obtain 

Vn+i < Rlog^n+ {k - R/2) ^^^ H + m + Vi)- 

n n — 1 2 

From this last equation, we see that the induction hypothesis carries over to 
n + 1, provided the inequalities R>2k and R{C — log^ n/2) + r]o + r]i <0 hold. 
Let no be the smallest index k > 2 such that log^ k/2 — C > rjo + rj\. 

We are now ready to choose the constant R. A suitable choice will be 

R = max ( max (77^/ log^ k) , 2k, 1 

\2<k<no ^ ' 

Indeed, taking R larger than max2<fc<no(^fe/log^ k) ensures that the induction 
hypothesis is satisfied in the range k = 2, . . . ,no- Taking it larger than 2k ensures 
that the first inequality we need to check in order to use induction is satisfied; 
taking it larger than 1 ensures that, for n > no, the second inequality i?(c — 
log^ n) + rjo + rji < 0 is also satisfied, hence we can use induction from no 
onwards. 

Corollary 1. There exists a constant R' > 0 such that, for all n > 1, 

In < R'n^- ( 10 ) 

Proof: Combining ([3l and (0) yields 



In <2n^ \r]o + m+R 



E 

k>2 



log^ k 
{k T 1) (A: + 2) 



from which the claim of the corollary follows if we choose R' = 2 {rjo + rji+ RC), 
where the constant C is as in the proof of the previous lemma. 

We now obtain from this corollary that Var(Al„) = Var(M„)/n^ < R' for 
all n. As a consequence, the martingale Xn is uniformly integrable, and by the 
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martingale convergence theorem, it converges almost surely to a finite random 
variable X^o as n ^ oo. In other words, the mean out-degree Mn/n is close to 
the target value of (c-l-1) logn in the precise sense that their difference converges 
to a finite random variable (not growing with n) as n oo. Thus, we finally 
obtain from Theorem 1 of that gossip in the resulting random graph reaches 
all participants with high probability provided the proportion of failed links is 
smaller than c/ {c+ 1). 



4 Simulation Results 



In this section we present some preliminary simulation results which confirm 
the theoretical analysis and show the self-organizing property of Scamp as well 
as the good quality of the partial views generated. We first study the size of 
partial views and then provide some results comparing the resilience to failure 
of a gossip-based algorithm relying on Scamp for membership management with 
one relying on a global scheme. 



4.1 View Size 

The first objective of Scamp is to ensure that each node has a randomized 
partial view of the membership, of the right size to ensure successful gossip. 
All experiments in this section have been done with c = 0, i.e., the objective 
is to achieve an average view size of log(n) . Recall that a fanout of this order 
is required to ensure that gossip is successful with high probability. The key 
result we want to confirm here is that a fully decentralized scheme as in Scamp 
can provide each node with a partial view of size approximately log(n), without 
global membership information or synchronization between nodes. 

In Figure 0 we plot the average size of partial views achieved by Scamp 
against system size. The figure shows that the average list size achieved by 
Scamp matches the target value very closely, supporting our claim that Scamp 
is self-organizing. Figure El shows the distribution of list sizes of individual nodes 
in a 5000 node system. The distribution is unimodal with mode approximately 
at log(n) ( log(5000) = 8.51). While analytical results on the success probability 
of gossip were derived in for two specific list size distributions, namely the 
deterministic and binomial distributions, we believe that the results are largely 
insensitive to the actual degree distribution and depend primarily on the mean 
degree0This is corroborated by simulations. 



^ The claim has to be qualified somewhat as the following counterexample shows. If 
the fanout is n with probability logn/n and zero with probability 1 — (logn/n), then 
the mean fanont is logn but the success probability is close to zero. Barring such 
extremely skewed distributions, we believe the claim to be true. An open problem 
is to state and prove a suitable version of this claim. 
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Fig. 1. Relation between System Size and Average List Size Produced by Scamp. 




Size of partial view 



Fig. 2. Histogram of List Sizes at Individual Nodes in a 5000 Node System. 



4.2 Resilience to Failures 

One of the most attractive features of gossip-based multicast is its robustness to 
node and link failures. Event dissemination can meet stringent reliability guar- 
antees in the presence of failures, without any explicit recovery mechanism. This 
makes these protocols particularly attractive in highly dynamic environments 
where members can disconnect for non-negligible periods and then reconnect. 

We compare a gossip-based protocol relying on Scamp with one relying on 
global knowledge of membership in terms of their resilience to node failures. 
Figure 0 depicts the simulation results. We plot the fraction of surviving nodes 
reached by a gossip message initiated from a random node as a function of 
the number of failed nodes. Two observations are notable. First, the fraction 
of nodes reached remains very high even when close to half the nodes have 
failed, which confirms the remarkable fault-tolerance of gossip-based schemes. 
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Second, this fraction is almost as high using Scamp as using a scheme requiring 
global knowledge of membership. This attests to the quality of the partial views 
provided by Scamp and demonstrates its viability as a membership scheme for 
supporting gossip. 



Resilience to node failure in a 5000 nodes system 




Number of failure 



^ SCAMP 

-^Global membership knowledge, fanout=8 
Global membership knowledge, fanout=9 



Fig. 3. Resilience to Failure in a System of 5000 Node System. 



5 Conclusion 

Reliable group communication is important in applications involving large-scale 
distributed systems. Probabilistic gossip-based protocols have proven to scale to 
a large number of nodes while providing attractive reliability properties. How- 
ever, most gossip-based protocols rely on nodes having global membership in- 
formation. For large groups, this consumes a lot of memory and generates a lot 
of network traffic due to the synchronization required to maintain global con- 
sistency. In order to use gossip-based algorithms in large-scale groups, which 
is their natural application domain, the membership protocol also needs to be 
decentralized and lightweight. 

In this paper, we have presented the design, theoretical analysis and evalu- 
ation of Scamp, a probabilistic peer-to-peer scalable membership protocol for 
gossip-based dissemination. Scamp is fully decentralized in the sense that each 
node maintains only a partial view of the system. It is also self-organizing: the 
size of partial views naturally increases with the number of subscriptions in or- 
der to ensure the same reliability guarantees as the group grows. Thus Scamp 
provides efficient support for large and highly dynamic groups. 

One of the key contributions of this paper is the theoretical analysis of 
Scamp, which establishes probabilistic guarantees on its performance. The anal- 
ysis, which is asymptotic, is confirmed by simulations, which show that a gossip- 
based protocol using Scamp as a membership service is almost as resilient to 
failures as a protocol relying on knowledge of global membership at each node. 
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Future work includes comparing Scamp with other membership protocols, 
and modifying it to take geographical locality into account in the generation of 
partial views. 
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Abstract. In multicast communication, it is often required that feed- 
back is received from a potentially very large group of responders while at 
the same time a feedback implosion needs to be prevented. To this end, 
a number of feedback control mechanisms have been proposed, which 
rely either on tree-based feedback aggregation or timer-based feedback 
suppression. Usually, these mechanisms assume that it is not necessary 
to discriminate between feedback from different receivers. However, for 
many applications this is not the case and feedback from receivers with 
certain response values is preferred (e.g., highest loss or largest delay). 
In this paper, we present modifications to timer-based feedback suppres- 
sion mechanisms that introduce such a preference scheme to differentiate 
between receivers. The modifications preserve the desirable characteristic 
of reliably preventing a feedback implosion. 



1 Introduction 

Many multicast protocols require receiver feedback. For example, feedback can 
be used for control and identification functionality for multicast transport pro- 
tocols El and for status reporting from receivers for congestion control ra- 
in such scenarios, the size of the receiver set is potentially very large. Sessions 
with several million participants may be common in the future and without an 
appropriate feedback control mechanism a severe feedback implosion is possible. 

Some multicast protocols arrange receivers in a tree hierarchy. This hierarchy 
can be used to aggregate receiver feedback at the inner nodes of the tree to 
effectively solve the feedback implosion problem. However, in many cases such a 
tree will not be available (e.g., for satellite links) or cannot be used for feedback 
aggregation (e.g., in networks without router support). For this reason, we will 
focus on feedback control using timer-based feedback suppression throughout 
the remainder of the paper. 

Pure end-to-end feedback suppression mechanisms do not need any addi- 
tional support except from the end-systems themselves and can thus be used for 
arbitrary settings. The basic mechanism of feedback suppression is to use ran- 
dom feedback timers at the receivers. Feedback is sent when the timer expires 
unless it is suppressed by a notification that another receiver (with a smaller 
timeout value for its feedback timer) already sent feedback. 
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Most of the mechanisms presented so far assume that there is no preference 
as to which receivers send feedback. As we will see, for many applications this 
is not sufficient. Those applications require the feedback to reflect an extreme 
value for some parameter within the group. Multicast congestion control, for 
example, needs to get feedback from the receiver(s) experiencing the worst net- 
work conditions. Other examples are the polling of a large number of sensors for 
extreme values, online auctions where one is interested in the highest bids, and 
the detection of resource availability in very large distributed systems. 

In this paper we propose several algorithms that favor feedback from receivers 
with certain characteristics while preserving the feedback implosion avoidance 
of the original feedback mechanism. Our algorithms can therefore be used to 
report extrema from very large multicast groups. In particular, they have been 
implemented as feedback mechanisms for the TFMCC protocol Q. 

Past work related to this paper is presented in section |2J In section 0 we 
summarize basic properties of timer-based feedback algorithms and give some 
definitions to be used in our analysis. Depending on the amount of knowledge 
about the distribution of the values to be reported we distinguish extremum 
detection and feedback bias. With the former we just detect extreme values 
without forcing early responses from receivers with extreme values. This variant 
which requires no additional information about the distribution of the values 
is studied in section El With the latter we exploit knowledge about the value 
distribution by biasing the timers of responders. Biased feedback is studied in 
section 0 In both sections, we give a theoretical analysis of the properties of our 
feedback mechanisms and present simulations that corroborate our findings. We 
conclude the paper and give an outlook on future work in section El 



2 Related Work 

Feedback suppression algorithms have already been widely studied and em- 
ployed. Good scalability to very large receiver sets can be achieved by expo- 
nentially distributing the receivers’ feedback times. A method of round-based 
polling of the receiver set with exponentially increasing response probabilities 
was first proposed in | 2 ] to be used as a feedback control mechanism for multi- 
cast video distribution. It was later refined by Nonnenmacher and Biersack |2j, 
using a single feedback round with exponentially distributed random timers at 
the receivers. In E]j the authors compare the properties of different methods of 
setting the timer parameters with exponential feedback and give analytical terms 
and simulation results for feedback latency and response duplicates. However, 
none of these papers consider preferential feedback. 

A simple scheme to gradually improve the values reported by the receivers 
is presented in p. Receivers continuously give feedback to control the sending 
rate of a multicast transmission. Since the lowest rate of the previous round is 
known, feedback can be limited to receivers reporting this rate or a lower rate. 
It is necessary to further adjust the rate limit by the largest possible increase 
during one round to be able to react to improved network conditions. After 
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several rounds, the sending rate will reflect the smallest feedback value of the 
receiver set. While not specifically addressed in the paper, this scheme could 
be used in combination with exponential feedback timers for suppression within 
the feedback rounds to reliably prevent a feedback implosion. However, with this 
scheme it may still take a number of rounds to obtain the optimum feedback 
value. 

Other algorithms not directly concerned with feedback suppression but with 
the detection of extremal values have been studied in the context of medium ac- 
cess control and resource scheduling [12l7j . The station allowed to use a shared 
resource is the one with the smallest contention parameter of all stations. A 
simple mechanism to determine this station is to use a window covering a subset 
of the possible contention parameters. Only stations with contention parameters 
within this window are allowed to respond and thus to compete for the re- 
source. Depending on whether no, one, or several stations respond, the window 
boundaries are adjusted until the window only contains the minimum contention 
parameter. In the above papers, strategies how to optimally adjust the window 
with respect to available knowledge about the distribution of the contention 
parameters are discussed. 

To our knowledge, the only work that is directly concerned with altering a 
non-topology based feedback suppression mechanism to solicit responses from 
receivers with specific metric values is presented in 0. The authors discuss two 
different mechanisms. Targeted Slotting and Damping (TSD) and Targeted It- 
erative Probabilistic Polling (TIPP). For TSD, response values are divided into 
classes and the feedback mechanism is adjusted such that response times for the 
classes do not overlap. Responders within a better class always get to respond 
earlier than lower-class responders. Thus, the delay before feedback is received 
increases linearly with the number of empty high classes. Furthermore, it is not 
possible to obtain real values as feedback without the assignment of classes. To 
prevent implosion when many receivers fall into the same class, the response in- 
terval of a single class is divided into subintervals and the receivers are randomly 
spread over these intervals. It was shown in that a uniform distribution of 
response times scales very poorly to large receiver sets. TIPP provides better 
scalability by using a polling mechanism based on the scheme presented in j^j, 
thus having more favorable characteristics than uniform feedback timers. How- 
ever, separate feedback rounds are still used for each possible feedback class. 
This results in very long feedback delays when the number of receivers is overes- 
timated and the number of feedback classes is large. Underestimation will lead 
to a feedback implosion. As a solution, the authors propose estimating the size 
of the receiver set before starting the actual feedback mechanism. Determining 
the size of the receiver set requires one or more feedback rounds. In contrast, 
the mechanisms discussed in this paper only require a very rough upper bound 
on the number of receivers and will result in (close to) optimal feedback values 
within a single round. A further assumption for TSD and TIPP is that the dis- 
tribution of the response values is known by the receivers. In most real scenarios 
this distribution is at best partially known or even completely unknown. If how- 
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ever the distribution is known, a feedback mechanism that guarantees optimum 
response values and at the same time prevents a feedback implosion can be built. 
Such a mechanism is presented in section 0 



3 General Considerations 

Let us first summarize some previous work pgf)] on feedback control on which 
we will later base our analysis. For feedback suppression with exponentially 
distributed timers, each receiver gives feedback according to the following mech- 
anism: 

Algorithm 1. (Exponential Feedback Suppression): 

Let N be an estimated upper bound on the number of potential responder^ and 
T an upper bound on the amount of time by which the sending of the feedback 
can be delayed in order to avoid feedback implosion. 

Upon receipt of a feedback request each receiver draws a random variable x 
uniformly distributed in (0, 1] and sets its feedback timer to 

t = T max(0; 1 -I- logjv a;) (1) 

When a receiver is notified that another receiver already gave feedback, it 
cancels its timer. If the feedback timer expires without the receiver having received 
such a notification, the receiver sends the feedback message. 

Time is divided into feedback rounds, which are either implicitly or explicitly 
indicated to the receivers. In case continuous feedback is required, a new feedback 
round is started at the end of the previous one (i.e., after the first receiver gave 
feedback) . 

Extending the suggestions in |0|, this algorithms sets the parameter of the 
exponential distribution to its optimal value A = In A and additionally intro- 
duces an offset of N~^ at t = 0 into the distribution that further improves the 
feedback latency. 

The choice of input parameters is critical for the functioning of the mecha- 
nism. While the mechanism is relatively insensitive to overestimation of the size 
of the receiver set, underestimation will result in a feedback implosion. Thus, a 
sufficiently large value for N should be chosen. Similarly, the maximum feedback 
delay T should be significantly larger than the network latencjfl r among the 
receivers since for T ks t & feedback implosion is inevitable. 

^ The set of potential responders is formed by the participants that simultaneously 
want to give feedback. If no direct estimate is possible, N can be set to an upper 
bound on the size of the entire receiver set. 

^ With network latency we denote the average time between the sending of a feedback 
response by any one of the receivers and the receipt (of a notification) of this response 
by other receivers. 
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The expected delay until the first feedback is sent is 




~ T(1 -log^ n) 

and the expected number of feedback messages sent is 



( 2 ) 




E[M]=N-/^(^^+(^ 





( 3 ) 



where n is the actual number of receivers. From Equation o we learn that E[M] 
remains fairly constant over a large range of n (as long as n < N) . 

A derivation of Equations o and 0 can be found in and [H] respectively. 

3.1 Unicast vs. Multicast Feedback Channels 

When receivers are able to multicast packets to all other receivers, feedback 
cancelation is immediate in that the feedback that ends the feedback round is 
received by other receivers at roughly the same time as by the sender. 

However, the mechanism described in the previous section also works in en- 
vironments where only the sender has multicast capabilities, such as in many 
satellite networks or networks where source-specific multicast 0 is deployed. In 
that case, feedback is first unicast back to the sender which then multicasts a 
feedback cancelation message to all receivers. This incurs an additional delay of 
half a round-trip time, thus roughly doubling the feedback latency of the sys- 
tem (in the case of symmetric transmission delays between the sender and the 
receivers and amongst the receivers themselves.) 

In order to safeguard against loss of feedback cancelation messages with 
unicast feedback channels, we note that it may be necessary to let the sender 
send multiple cancelation messages in case multiple responses arrive at the sender 
and/or to repeat the previous cancelation message after a certain time interval. 
Loss of cancelation messages is critical since a delayed feedback cancelation is 
very likely to provoke a feedback implosion. 

3.2 Message Piggybacking 

The feedback requests and the cancelation messages from the sender can both 
be piggybacked on data packets to minimize network overhead. In case a unicast 
feedback channel is used, piggybacking has to be done with great care since at 
low sending rates the delayed cancelation messages may provoke a feedback im- 
plosion. This undesired behavior is likely to occur when the inter-packet spacing 
between data packets gets close to the maximum feedback delay. 

The problem can be prevented by not piggybacking but sending a separate 
cancelation message at low data rates (i.e., introducing an upper bound on the 
amount of time by which a cancelation message can be delayed). If separate 
cancelation messages are undesirable, it is necessary to increasing the maximum 
feedback delay T in proportion to the time interval between data packets. 
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3.3 Removing Latency Bias 

Plain exponential feedback favors low-latency receivers since they get the feed- 
back request earlier and are thus more likely to suppress other feedback. In case 
the receivers know their own latency r as well as an upper bound on the latency 
for all receivers Tmax, it is possible to remove this bias. Receivers simply schedule 
the sending of the feedback message for time t + (r^nax — t) instead of t. 

In fact, this unbiasing itself introduces a slight bias against low-latency re- 
ceivers in case unicast feedback channels are used. While the first feedback mes- 
sage is unaffected, subsequent duplicates are more likely to come from high- 
latency receivers, since they will receive the feedback suppression notification 
from the sender later in time. 

If it is not necessary to remove the latency bias, the additional receiver het- 
erogeneity generally improves the suppression characteristics of the feedback 
mechanism, as demonstrated in 0. Similar considerations hold for the suppres- 
sion mechanisms discussed in the following sections. 

4 Extremum Detection 

Let us now consider the case where not only an arbitrary response from the 
group is required but an extreme value for some parameter from within a group. 
Depending on the purpose the required extremum can be either a maximum 
or a minimum. Without loss of generality we will formulate all algorithms as 
maximum detection algorithms. 

4.1 Basic Extremum Detection 

An obvious approach to introduce a feedback preference scheme is to extend the 
normal exponential feedback mechanism with the following algorithm: 

Algorithm 2. (Basic Extremum Detection): 

Let vi > V 2 > ■ ■ ■ > Vk > 0 be the set of response values of the receivers. 

Upon receipt of a feedback request each receiver sets a feedback timer accord- 
ing to Algorithm^ When a receiver with value v is notified that another receiver 
already gave feedback with v' > v, it cancels its timer. Otherwise, when the feed- 
back timer expires (i.e., for all previous notifieations v' < v or no notifieations 
were received at all), the receiver sends a feedback message with value v. 

With this mechanism the sender will always obtain feedback from the receiver 
with the largest response value within one feedback round. 

Let us now analyze the algorithm in detail: Following Equation Q we use n 
for the actual number of potential responders and denote the expected number of 
feedback messages in Algorithm [H with R{n) := E[M]. Let pi be the fraction of 
responders with value Vi . For fc = 1 the problem reduces to Algorithm H and we 
expect i?(n) feedback messages. For k = 2 we can reduce the problem to the pre- 
vious case by assuming that every vi responder responds with both a vi and a V 2 
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message. Hereby, we can treat both groups independently from each other while 
preserving the fact that v\ responders also stop further (unnecessary) responses 
from V 2 responders. Summing up both expected values we have R{pin) + R{n) 
messages. However, pi of the V 2 messages were sent by v\ responders and are 
thus duplicates. Subtracting these duplicates we obtain R{pin) +p 2 R{n) for the 
expected number of responses. 

This argument can be extended to the general case 

E[M] = R{p\n) H — — R{pin+p 2 n) 

Pi +P 2 

H ; ; R{pin + p 2 n + p^n) 

P1+P2+ P 3 



+ PkR{n) 






(4) 



where Pi := pi + P2 + + Pi and thus Pk = 1. According to i?(n) remains 

approximately constant over wide ranges of n. Assuming R{n) R, pi 4, and 
A: 1 we have 

£[M]=.(l + i + i + ... + i)R 

~{\Yik + C)R (5) 



where C = 0.577. . . denotes the Euler constant. 

From this analysis we see that the number of possible feedback values has an 
impact on the expected number of feedback messages. For a responder set with 
a real-valued feedback parameter this results in E\M] ~ ln(n)i?. 



4.2 Class-Based Extremum Detection 

Although this logarithmic increase is well acceptable for a number of applica- 
tions, the algorithm’s properties can be further improved by the introduction of 
feedback classes. Within those classes no differentiation is made between differ- 
ent feedback values. It is not necessary to choose a fixed size for all classes. The 
class size can be adapted to the required granularity for certain value ranges. 
In case a fixed number of classes is used, the expected number of feedback mes- 
sages increases only by a constant factor over normal exponential feedback. This 
increase is expectedly observed in the simulation results shown in Figure E As 
the number of classes approaches the number of receivers, the increase in feed- 
back messages follows more and more the logarithmic increase for real-valued 
feedback as stated in Equation 0 For all simulations in this paper we use the 
parameters N = 100, 000 and T = At and average the results over 200 simulation 
runs, unless stated otherwise. 

By adjusting the classes’ positions depending on the actual value distribution, 
the number of classes required to cover the range of possible feedback values can 
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Fig. 1. Uniformly Sized Classes. 



Value Value Value 




■ Sent Feedback □ Suppressed Feedback ■ Suppressed Values 



Fig. 2. Class-Based Suppression with Variable Class Position. 



be reduced without increasing the intervals’ actual size. Thereby, the granularity 
of the feedback suppression (i.e., to what extent less optimal values can suppress 
better values) remains unchanged while the number of feedback messages is 
reduced. 

Figure El gives a schematic overview of this mechanism. The first diagram 
shows the classless version of the feedback algorithm. Here, each time a feedback 
message Vi is sent, the range of suppressed value increases to [0;z)i]. A total of 
four feedback messages is sent in this example. The second diagram shows the 
same distribution of feedback for the case of static classes. We assume equally 
sized classes of size 6 and v\ G [0; 5] for this example. After receipt of the 
first feedback message vi, the entire range [0; i5] of the lowest feedback class 
is suppressed. Only when a value outside this class is to be reported another 
message is sent, resulting in three feedback messages in total. The third diagram 
shows the case of dynamically adjusted classes. Upon receipt of the first feedback 
message vi the suppression limit is immediately raised to -I- <5 and thus the 
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value range [0; t'l + iJ] is now being suppressed. Through this mechanism feedback 
is reduced to only two messages. 

With the above considerations, an elegant way to introduce feedback classes 
is the modification of Algorithm |3 to suppress feedback not only upon receipt 
of values strictly larger than the own value v but also upon receiving values 
v' > (1 — q)v. This results in an adaptive feedback granularity dependent on the 
absolute value of the optimum. 

Algorithm 3. (Adaptive Class-Based Extremum Detection) : 

Let q be a tolerance factor with q € [0; 1] . Modify Algorithm\^ such that a respon- 
der with value v cancels its timer if another responder has already sent feedback 
for value v' with v' > (1 — q)v. 

For <7 = 0 the algorithm is equivalent to Algorithmic whereas for g = 1 we 
obtain Algorithm ^ 





Fig. 3. Number of Feedback Messages for Maximum Search (left) and Minimum 
Search (right). 



Assuming the values Vi to be evenly distributed between rvmax and Vmax 
(0 < r < 1) we have approximately k feedback classed, where k < For a 

value range 0 < < 1 we can assume k < setting r inversely proportional 

to the number of receivers since the receiver set is too small to cover the whole 
range of possible values. 

Approximating further with 

(l-g)*-i-(l-g)* (l-#-i 

Pi = 1 = 9 — 1 

1 — r 1 — r 

and 

^ 1 - (1 - g)^ 



® We assume the parameter range (r, 1) to be fully covered by the feedback classes 
which is not strictly the case for this algorithm. This approximation thus overesti- 
mates the expected number of feedback messages. 
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we have 

Ema.[M] <qnj2 ( 6 ) 

The mechanism strongly benefits from the feedback classes being wider near 
the maximum and so holding more values than the classes near the minimum. 
As a consequence, the expected number of feedback messages is much lower 
compared to that of the previous algorithm. Note that for small r, the number 
of members with v < {1 — q)~^r can be very small. Eventually, these feedback 
classes will contain only a single member and we therefore loose the desired 
suppression effect that leads to a sub-logarithmic increase of feedback messages. 
In maximum search this effect cannot be observed since already a single response 
in the larger feedback classes near the maximum will suppress all feedback from 
the potentially large number of small classes. In fact, this characteristic is not 
specific to maximum and minimum search but rather depends on the classes 
being large or small near the optimum. 

To demonstrate the effect we will calculate the expected number of feedback 
messages for a minimum search scenario: The feedback values Vi are again evenly 
distributed between rvmax and Vmax, but in contrast to Algorithm 0 a responder 
cancels its timer if a response with < (1 — q)v is received. The algorithm 
produces the minimal value of the group within a factor of 

The feedback classes are in the opposite order as compared to our previous 
calculation. 

(1 - - (1 - (1 - 

Pi = = 9 — ^ 

1 — r 1 — r 

and 

^ 1 — r 

Thus 

Emin[M] <qR'^ 1 _ (1 _ q\i 

« (1 - g) Emax[M] + kqR 

Hence, for small r (large k) the sum is significantly larger than in the previous 
case. 

Both scenarios have been simulated with various values for q. The sub- 
logarithmic increase of feedback messages can be seen in both plots shown in 
Figure 0 But only with maximum search where the feedback-classes near the 
search goal are wider, the strong class-induced suppression dominates the ln(n) 
scale-effect. 

^ As far as the expected number of feedback messages is concerned, this mechanism 
is equivalent to a maximum search with small class sizes for classes close to the 
optimum. 
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Numeric values for the upper limits on the expected number of feedback 
messages in both scenarios can be obtained from Equations and 0 . Some 
example values are shown in Table ^ These limits match well with the results 
of our simulations. 



Table 1. Upper Limits (as Factor of R) for the Expected Number of Feedback 
Messages (r = 10“^). 



q 


0.05 


0.10 


0.25 


0.50 


1.00 


Maximum search 


3.64 


3.00 


2.19 


1.59 


1.00 


Minimum search 


7.91 


7.00 


5.64 


3.80 


1.00 



As mentioned before, AlgorithmOlguarantees a maximum deviation from the 
true optimum of a factor oi q. It is worthwhile to note that this factor really 
is an upper bound on the deviation. Almost always the reported values will 
be much closer to optimal since the sender can choose the best one of all the 
responses given. The deviation of the best reported value from the optimum for 
different tolerance factors q is depicted in Figure 0 On average, with normal 
exponential suppression (i.e., q — 100%) the best reported value lies within 10% 
of the optimum, for q = 50% the deviation drops to less than 0.15%, for q — 10% 
we obtain less than 0.02% deviation, etc. Thus, even for relatively high q with 
consequently only a moderate increase in the number of feedback messages, the 
best feedback values have only a marginal deviation from the optimum. 
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Fig. 4. Feedback Quality with Different Tolerance Values. 
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5 Biased Feedback 

The previously described algorithms yield considerable results for various cases of 
extremum detection. However, they will not affect the expected value of the first 
feedback message but only improve the expected values of subsequent messages. 
In certain cases, the algorithms can be further improved by biasing the feedback 
timers. Increasing the probability that ti < t 2 if v\ > V 2 results in better feedback 
behaviour but we must carefully avoid a feedback implosion for cases where many 
large values are present in the responder group. Without loss of generality we 
assume v € [0, 1] for the remainder of this section. 

If the probability distribution of the values v is known, the number of re- 
sponses can be minimized using the following algorithm: 

Algorithm 4. (Deterministic Feedback Suppression) : 

Let P{v) = P{v' < v) be the probability distribution function of the values v 
within the group of responders. We follow Algorithm^ but instead of drawing a 
random number we set the feedback time directly to 

t = T max(0; 1 -I- logjv(l - P(v))) 

Clearly, duplicate feedback responses are now only due to network latency ef- 
fects since the responder with the maximum feedback value is guaranteed to have 
the earliest response time. However, the feedback latency is strongly coupled to 
the actual set of feedback values. Moreover, if the probability distribution of this 
specific set does not match the distribution used in the algorithm’s calculation, 
feedback implosion is inevitable. For this reason. Algorithm 2] should only be 
used if the distribution of feedback values is well known for each individual set 
of values. 

The latter condition is crucial. In general, it does not hold for values from 
causally connected responders. Consider for example the loss rate for multicast 
receivers: If congestion occurs near the sending site, all receivers will experience a 
high packet loss rate simultaneously. Since the time-average distribution does not 
show this coherence effect the algorithm presented above will produce feedback 
implosion, if used to solicit responses from high-loss receivers. Due to this effect, 
the application of this simple mechanism is quite limited. It can be used, for 
example, with application level values where no coherence is generated within 
the network. 

A simple way to adopt the key idea of value-based feedback bias is to mix 
value-based response times with a random component. This mechanism can be 
applied in various cases where coherence effects prohibit the application of Al- 
gorithm 0 Let us study an example: 

Algorithm 5. (Feedback with Combined Bias): 

Apply Algorithm^ but modify the feedback time to 

t = T max(0; (1 — u) -I- u(l -I- logjy x)) 

= T max(0; (1 -I- logjy x")) 



( 8 ) 
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Here, the feedback time consists of a component linearly dependent on the 
feedback value and a component for the exponential feedback suppression. The 
feedback time t is increased in proportion to decreasing feedback values v and 
a smaller fraction of T is used for the actual suppression. As long as at least 
one responder has a sufficiently early feedback time to suppress the majority of 
other feedback this distribution of timer values greatly decreases the number of 
duplicate responses while at the same time increasing the quality of the feedback 
(i.e., the best reported value with respect to the actual optimum value of the 
receiver set). Furthermore, in contrast to pure extremum detection algorithms 
this mechanism improves the expected feedback value of the first response as 
well as subsequent responses. 

However, the feedback suppression characteristics of the above mechanism 
still depend at least to some extent on the value distribution at the receivers. 
Some extreme cases such as r; = 0 for all receivers will always result in a feedback 
implosion. A more conservative approach is to not combine bias and suppression 
but use a purely additive bias. 

Algorithm 6. (Feedback with Additive Bias): 

Apply Algorithm^ but modify the feedback time to 

t = T max (0; 7(1 - z;) + (1 - 7)(1 + log^ a;)) (9) 



with 7 G [0; 1] . 

To retain the same upper bound on the maximum feedback delay, it is nec- 
essary to split up T and use a fraction of T to spread out the feedback with 
respect to the response values and the other fraction for the exponential timer 
component. As long as (1 — 7 )T is sufficiently large compared to the network 
latency r, an implosion as in the above example is no longer possible. 

To better demonstrate the characteristics of these modifications, Figures|S|to 
0 show how the feedback time changes with respect to response values compared 
to normal unbiased feedback according to Algorithm ^ A single set of random 
variables was used for all the simulations to allow a direct comparison of the 
results. For the simulations, the parameters N and n were set to 10, 000 and 
2, 000 respectively!^ In these simulations we do not consider maximum search 
but only how feedback biasing affects the distribution of feedback timers. Thus, 
to isolate the effect of feedback biasing, only a single feedback class was used 
such that the first cancelation notification suppresses all subsequent feedback. 
All simulations were carried out with T = 4r as well as T = 8 r to demonstrate 
the impact of the feedback delay on the number of feedback responses. Each 
graph shows the feedback times in t of the receiver set along the x-axis and 
the corresponding response values on the y-axis for each of the three feedback 
mechanisms no bias (Algorithm [Q, combined bias (Algorithm EJ, and additive 

® Note that using n = N — It), 000 instead of n = 2, 000 would reduce the probability 
of an implosion since the probability that one early responder suppresses all others 
increases. 
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Fig. 5. Feedback Time and Value (Uniform Distribution of Values). 
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Fig. 6. Feedback Time and Value (Exponential Distribution of Values). 
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Fig. 7. Feedback Time and Value (Truncated Uniform Distribution of Values). 
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bias (Algorithmic) with 7 = 1/4. Suppressed feedback messages are marked with 
a dot, feedback that is sent is marked with a cross, and the black square indicates 
which of these feedback messages had a value closest to the actual optimum of 
the receiver set. 

In the graphs in Figure El the response values of the receivers are uniformly 
distributed. When no feedback bias is used, the first response that suppresses the 
other responses is random in value. In contrast, both feedback biasing methods 
result in the best reported feedback value being very close to the actual optimum. 
The number of sent feedback messages is higher with the two biasing methods 
since a smaller fraction of T is used for feedback suppression. Naturally, the 
number of feedback messages also increases when T is smaller as depicted in the 
right graph. 

In Figure 1C the same simulations were carried out for an exponential distri- 
bution of response values with a high probability of being close to the optimum. 
(When a reversed exponential distribution with most values far from the op- 
timum is used, the few good values suppress all other feedback and again a 
feedback implosion is always prevented.) As can be seen from the graph, feed- 
back suppression works well even when the actual distribution of response values 
is no longer uniform. For a uniform as well as an exponential distribution of re- 
sponse values, the combined bias suppression method results in fewer feedback 
messages while maintaining the same feedback quality. 

However, as mentioned before, combining bias and suppression permits a 
feedback implosion when the range of feedback values is smaller than anticipated. 
In this case, the bias results in an unnecessary delaying of feedback messages, 
thus reducing the time that can be used for feedback suppression. In Figure Q 
the response values are distributed uniformly in [0;0.25] instead of [0;1]. For 
T = 4t, the time left for feedback suppression is r, resulting in a scenario 
where no suppression is possible and each receiver will send feedback. Even 
when T = 8t and thus a time of 2r can be used for the feedback suppression, 
the number of feedback messages is considerably larger than in simulations with 
an additive bias. The exact numbers for the feedback responses of the three 
methods are given in Table El 



Table 2. Number of Responses with the Different Biasing Methods. 



Feedback Time 


No Bias 


Additive 


Combined 


T=4, Uniform 


5 


19 


15 


T=4, Exponential 


5 


14 


12 


T=4, Truncated 


5 


14 


2000 


T=8, Uniform 


2 


6 


4 


T=8, Exponential 


2 


2 


2 


T=8, Truncated 


2 


2 
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For suppression to be effective, the amount of time reserved for the expo- 
nential distribution of the feedback timers should not be smaller than 2r. Thus, 
the feedback implosion with Algorithm 0 can be prevented by bounding v such 
that vT > 2 t (i.e., using v' = max(u; 2t/T) instead of v in Equation 0 . In the 
worst case, the distribution of the feedback timers is then similar to an unbiased 
exponential distribution with T = 2r. A higher upper bound can be used to 
reduce the expected number of feedback messages in the worst case. The same 
considerations hold for the choice of the value of 7 for the additive feedback bias. 

The outcome of a single experiment is not very representative since the num- 
ber of feedback messages is extremely dependent on the feedback values of the 
early responders. As for the previously discussed feedback mechanisms, we de- 
pict the number of feedback messages for combined and additive bias averaged 
over 200 simulations in Figure 0 




Number of Receivers 



Fig. 8. Number of Responses with Feedback Biasing. 



The main advantage of the feedback bias is that the expected response value 
for early responses is improved. This not only reduces the time until close to op- 
timal feedback is received (with unbiased feedback and class-based suppression, 
close to optimal feedback is likely to arrive at the end of a feedback round) but 
also reduces the number of responses with less optimal feedback. 

Figure El shows how the feedback quality improves compared to the normal 
exponential feedback suppression when biasing the feedback timer. The maxi- 
mum deviation is reduced from about 15% to 6% for additive bias and to less 
than 2% for combined bias. 

While a similar increase in feedback quality can be achieved by using feedback 
classes (at the expense of an increased number of feedback messages) only with a 
feedback bias is it possible to improve the quality of the first feedback message. 
In case a close to optimal value is needed very quickly, using either Algorithm 0 
or Algorithm 0 can be beneficial. Figure E3 depicts the average deviation of 
the value of the first feedback message from the optimum. Here, the increase 
in quality is much more obvious than in the previous case. With all unbiased 
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Fig. 9. Deviation of Best Responses Value from Optimum. 



feedback mechanisms, the first reported value is random and thus the average 
deviation is 50% (for large enough n) whereas the combined and the additive 
biased feedback mechanisms achieve average deviation values around 10% and 
30% respectively. 
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Fig. 10. Deviation of First Response Value from Optimum. 



Lastly, the expected delay until the first feedback message is received is of 
concern. While all mechanisms adhere to the upper bound of T, feedback can be 
expected earlier in most cases. In Figure we show the average feedback delay 
for biased and unbiased feedback mechanisms. For all algorithms the feedback 
delay decreases logarithmically for an increasing number of receivers. The exact 
run of the feedback curve depends on the amount of time used for suppression. 
For this reason, unbiased feedback delay drops faster than biased feedback, since 
a bias can only delay feedback messages compared to unbias feedback. In case 
the number of receivers is estimated correctly (i.e., n = N), the feedback delay 
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for unbiased feedback drops to r, the minimum delay possible for such a feedback 
system. Biased feedback delay is slightly higher with approximately 1.5t. 




Number of Receivers 



Fig. 11. Average Feedback Delay. 



6 Conclusions 

In this paper we presented mechanisms that improve upon the well-known con- 
cept of exponential feedback suppression in case feedback of some extreme value 
of the group is needed. We discuss two orthogonal methods to improve the qual- 
ity of the feedback given. If no information is available about the distribution 
of the values at the receivers, a safe method to obtain better feedback is to 
modify the suppression mechanism to allow the sending of high valued feed- 
back even after a receiver is notified that some feedback was already given. We 
give exact bounds for the expected increase in feedback messages for a given 
improvement in feedback quality. If more information about the distribution of 
feedback values is available or certain worst-case distributions are very unlikely, 
it is furthermore possible to bias the feedback timer. The better the feedback 
value the earlier the feedback is sent, thus suppressing later feedback with less 
optimal values. The modified suppression mechanism and the feedback biasing 
can be used in combination to further improve the feedback process. 

The mechanisms discussed in this paper have been included in the TCP- 
friendly Multicast Congestion Control Protocol (TFMCC) Plj. It uses class- 
based feedback cancelation as well as feedback biasing to determine the current 
limiting receiver (i.e., the receiver with the lowest expected throughput of the 
multicast group). The protocol depends on short feedback delays in order to 
quickly respond to congestion. Selecting the correct receiver as current limit- 
ing receiver is critical for the functioning of the protocol since a wrong choice 
may compromise the TCP-friendly characteristics of TFMCC. In that sense, the 
feedback mechanism is an important part of the TFMCC protocol. 
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Extremum feedback is not yet included in any other application, but we 
believe a number of applications can benefit from it. 

6.1 Future Work 

In the future we would like to continue this work in several directions. 

Most applications need to consider only one type of feedback value. Never- 
theless, it may sometimes be useful to get multivalued feedback, for example to 
monitor some critical parameters of a large network, where changes in each of 
the parameters are equally important. It may not always be possible to aggregate 
different types of values to one single “ranking” value. In this case, a multivalued 
feedback mechanism clearly has better suppression characteristics than separate 
feedback mechanisms for each of the relevant values. 

Another important step will be the combination of knowledge about the 
value distribution within the responder group with implosion avoidance features. 
Several mechanisms to estimate the size of the receiver set from the feedback time 
and the number of feedback messages with exponential feedback timers have been 
proposed . Combining such estimation methods with extremum feedback, it 
should be possible to estimate the distribution of response values at the receivers 
in case this distribution is not known. For continuous feedback, this knowledge 
can then be used to generate feedback mechanisms based on Algorithm 0 In 
scenarios where the distribution of response values is not uniform, we expect 
that such an approach will outperform the biasing mechanisms presented in 
section 0 which do not take the distribution into account. 

Taking these considerations one step further, in some cases the maximum 
change of the relevant state during one feedback round is bounded. For example, 
in the case of TFMCC, the measurements to determine round-trip time and loss 
event rate are subject to smoothing, thus limiting the maximum rate increase 
and decrease per round-trip time. In case some information about the previous 
distribution of feedback values is available (e.g., from the previous feedback 
round), it is possible to infer the worst case distribution of the current feedback 
round. This allows to further improve the feedback algorithm by tailoring it to 
the specific distribution. 
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Abstract. TBCP is a generic Tree Building Control Protocol designed 
to build overlay spanning trees among participants of a multicast session, 
without any specific help from the network routers. TBCP therefore falls 
into the general category of protocols and mechanisms often referred to 
as Application- Level Multicasting. TBCP is an efficient, distributed pro- 
tocol that operates with partial knowledge of the group membership 
and restricted network topology information. One of the major strate- 
gies in TBCP is to reduce convergence time by building as good a tree 
as possible early on, given the restricted membership/topology informa- 
tion available at the different nodes of the tree. We analyse our TBCP 
protocol by means of simulations, which shows its suitability for purpose. 



1 Introduction 

It is well known that the research community has proposed several basic models 
and a plethora of supporting protocols for multicast in the Internet, none of 
which has been ultimately deployed on the very large scale of the whole Inter- 
net Pj. Making multicast deployment a more difficult task is the fact that even 
the network layer service of the Internet is in the process of being (at least par- 
tially) updated, due to the introduction of a new version of the Internet Protocol 
(IPv6). 

Both the versions of the IP protocol support multicast transmission of data- 
grams to several receivers j2], i.e. they reserve a set of addresses to identify 
groups of receivers and they assume that network routers are able to replicate 
packets and deliver them to all entities registered as members of a group. To 
carry out such a job, a number of complementary control and routing protocols 
must be deployed both in the end systems and in the network routers. Such 
deployment of new software in the network routers (i.e. changes to the network 
infrastructure) is a major hurdle to ubiquitous deployment, and is responsible 
for very long rollout delays. 

This situation has led the research community to propose mechanisms and 
protocols to build overlay spanning trees among hosts in a multicast session 
(and possibly service nodes, which are hosts placed strategically in the network 

* This work was supported by the EC 1ST GCAP (Global Communication Architec- 
ture and Protocols) Project. 
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to facilitate and optimise the construction of these overlays). Obviously, the 
distribution of multicast data with such overlay spanning trees is less efficient 
than in native IP multicast mode, but this relative performance penalty must 
be contrasted with the easy and speed of deployment offered by the overlay 
technique. 

Until recently, the main paradigm enabling multicast in the Internet was the 
dual concept of group and associated group address. However, these concepts 
alone lead to an open service model for IP multicast where anybody can send to 
a group and that therefore presents serious scalability, security and deployment 
concerns 0. This is why a restricted Internet multicast model called Source- 
Specific Multicast (SSM) has been proposed 00. SSM is based on the concept 
of multicast channel (i.e. a pair (source address, multicast address)) and the fact 
that only the channel source can send on the channel. This new model for mul- 
ticast service, because of its simplicity and elegance, has attracted widespread 
support within both the academic and industrial communities. 

Unfortunately, this SSM model breaks many of the mechanisms and protocols 
that have been proposed for reliable multicast communications (e.g. 0 ), due to 
the lack of a multicast backchannel from the receivers to the rest of the multicast 
group. It should be noted that such a problem is not specific to an SSM multicast 
model, as it can also manifest itself on asymmetric satellite link with terrestrial 
returns, some firewall configurations and some proprietary multicast deployment 
schemes (e.g. UUnet multicast deployment). 

This observation, along with the fact that Tree-based ACKs (TRACKS) PE| 
psimi appear to be an excellent candidate for scalable, (near) real-time reliable 
multicast communications, argues for overlay multicast spanning trees to be used 
for control purposes in a reliable multicast scenario. 

In this paper, after briefly reviewing techniques that have been proposed 
to build both multicast data and control trees, we present our Tree Building 
Control Protocol (TBCP) which is capable of efficiently building both data and 
control trees in any network environment. Finally, we analyse the performance 
and characteristics of the overlay trees built by TBCP. 

2 Related Work 

The concept of tree has proved to be a particularly well suited and efficient 
dissemination structure for carrying both data and control information in group 
communications . 

Even if, and when, IP multicast becomes widely deployed, the need for tree- 
based control structures will remain in a reliable multicast scenario. The nodes 
of such control trees must understand and take part in the control protocols 
used. These nodes can either be the end-hosts (sender and receivers), service 
nodes, network routers implementing the appropriate control functions, or any 
combination of these. 

The best way to build control trees that are congruent to the IP multicast 
data distribution tree is, of course, to support the construction of the control tree 
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within the network. Pretty Good Multicast (PGM) ^ builds control trees based 
on the concept of reverse path. PGM control trees are not overlay trees because 
internal tree nodes are PGM routers. For that reason, PGM is very efficient. 

In [II I j , overlay trees are built based on positional reachcasting. Reachcasting 
consists of sending packets to the nearest member host in any direction of a 
multicast routing tree and relies on multicast routers supporting the notion of 
nearest-host routing. The overlay trees built by with this technique are optimal. 

Unfortunately, the above mentioned tree building techniques require modifi- 
cations to the network infrastructure and are yet to be widely deployed. 

Techniques to build overlay trees without relying on special router features, 
apart from multicast routing, have also been proposed. TMTP m uses expand- 
ing ring searches (ERS) to discover tree neighbours. However, ERS does not 
perform well in asymmetrical networks, so that the notion of unsolicited neigh- 
bour announcements has been introduced in TRAM jin]- TMTP and TRAM 
both require multicast support in the network and will not operate in an SSM 
context. 

More recently, techniques for building overlay spanning trees without de- 
pending on multicast routing support in the network have been proposed. We 
do not attempt an exhaustive review of this area, but rather try and present the 
most significative proposals. 

ALMI j I .'tj is representative of centralised techniques, where a session con- 
troller has total knowledge of group membership as well as of a the character- 
istics of a mesh connecting the members (this mesh improves over time). Based 
on this knowledge, the session controller builds a spanning tree whose topology 
is distributed to all the members. However, as centralised approach do not scale, 
distributed techniques are required to support groups larger than a few tens of 
members. 

In Yoid jl| , a prospective receiver learns about some tree nodes from a rendez- 
vous point and chooses one of them as its parent. The place in the tree where 
a new member joins may not be optimal and a distributed tree management 
protocol is employed to improve the tree over time. However, for scalability 
reasons, the convergence time to optimality can be rather slow. Yoid hosts also 
maintain an independent mesh for robustness. 

In Narada Q, the hosts first build a random mesh between themselves, then 
a (reverse) shortest path spanning the mesh. For robustness, each host gains full 
knowledge of the group membership (Narada is targetted towards small multicast 
groups) through gossiping among mesh neighbours, and this knowledge is used 
to slowly improve the quality of the mesh. 

Overcast ^j is another approach aimed at an infrastructure overlay. It is 
specifically targeted towards bandwidth efficient overlay trees. Overcast is a 
tree-first approach building unconstrained trees (i.e. tree nodes do not limit 
the number of children they serve). 
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3 Tree Building Control Protocol (TBCP) 

Building an overlay spanning tree among hosts presents a significant challenge. 
To be deployable in practice, the process of building the tree should not rely 
on any network support that is not already ubiquitous. Furthermore, to en- 
sure fast and easy deployment, we believe the method should avoid having to 
rely on “its own infrastructure” (e.g. servers, “application-level” routing pro- 
tocols, etc.), since the acceptance and deployment of such infrastructure could 
hamper the deployment of the protocol relying on it. Consequently, one of our 
major requirements is to design a tree building control protocol that operate 
between end-systems exclusiveljQ and considers the network as a “black-box” . 
End-systems can only gain “knowledge” about the network through host-to-host 
measurement samples. 

From a scalability point of view, it is also unrealistic to design a method 
where pre-requisite knowledge of the full group membership is needed before the 
spanning tree can be computed. This is especially true in the context of multicast 
sessions with dynamic group membership. For robustness and efficiency, as well 
as scalability, it is also preferable to avoid making use of a centralised algorithm. 

Our goal is therefore to design a distributed method which builds a span- 
ning tree (which we also call the TBCP tree) among the hosts of a multicast 
session without requiring any particular interactions with network routers nor 
any knowledge of the network topology, and which does so with only partial 
knowledge of the group membership. 

TBCP is a tree- first, distributed overlay spanning tree building protocol, 
whose strategy is to place members in a (near) optimal position at joining time. 

Compared with the proposals discussed in the previous section, TBCP can 
be viewed as complementing Yoid, in the sense that TBCP could advantageously 
replace the rather “random bootstrap” approach in Yoid. Unlike Narada, TBCP 
does not require full membership knowledge and is not a mesh- first protocol. 
Also, because of the fundamental strategy in TBCP, tree convergence time should 
be much faster than in Yoid and Narada. 

The general approach to building the tree in TBCP is similar to the one 
followed in overcast. However, TBCP is not restricted to bandwidth optimised 
trees and, maybe more importantly, TBCP trees are explicitly constrained (each 
host fixes an upper-limit on the number of children it is willing to support). 



3.1 TBCP Join Procedure 

In TBCP, the root of the spanning tree (e.g. main sender to the group) is used as 
a rendez-vous point for the associated TBCP tree, that is new nodes “join” the 
tree at the root. Hence a TBCP tree can be identified by the (S,SP) pair, where 
S is the IP address of the TBCP tree root, and SP the port number used by 

^ We want to emphasise that infrastructure nodes are not considered as a pre-requisite 
in our approach. However, should such nodes exist, TBCP could, of course, be used 
to interconnect them. 
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the root for TBCP signalling operations. This information is the only advertised 
information needed to join the TBCP tree. 

Each TBCP entity (including the tree root) fixes the maximum number of 
“children” it is willing to accommodate in the spanning tree. This value, called 
the fanout of the entity, allows the entity to control the load of traffic flowing on 
the TBCP tree that it will handle. The TBCP Join procedure is a “recursive” 
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Fig. 1. TBCP Join Procedure Messages. 



mechanism and works as follows: 



1. A newcomer N contacts a candidate parent P with a HELLO message, start- 
ing with the tree root S (figure |l.(a)D . 

2. P sends back to N the list of its existing children Ci in a HELLO_ACK 
message (figure |T.(b)| ), starts a timer Tg and wait for a JOIN message from 
N (figure |l.(a)[ This timer is needed because, for consistency reasons, P 
cannot accept any new HELLO message until it has finished dealing with 
the current newcomer. 

3. N estimates its “distance” (i.e. takes measurement samples) from P and all 
CjS and sends this information to P in a JOIN message (figure |T.(c) I. Note 
that if P has not received this JOIN message within its timer Tq, P sends a 
RESET message to N, meaning that N needs to restart the procedure from 
stage I. 

4. P finds a place for N, by evaluating all possible “local” configurations having 
P as parent and involving Ci U N (see figure |2I) . A score function is used to 
evaluate “how good” each configuration is, based on the distance estimates 
among P, N and the CiS. 

5. Depending on which configuration P has chosen, the following occurs: 

(a) if N is accepted as a child of P, P sends a WELCOME message to N, 
which N acknowledges immediatly (figures ra and ra- The join 
procedure is then completed for N. 
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(b) if either N or any of P’s children (say Cj) is to be redirected to one of 
P’s children (say Ck), that node is sent a GO(Cfc) message (figure p~(^ 
and starts a join procedure (stage 1) with Ck assuming the role of P and 
Cj the one of N. When N receives such a message, it acknowledges this 
immediately with a GO_ACK message (figure [r.(g)| l. However, in order 
not to disrupt the flow of data for an already established node, Cj is 
given a time Ti to find its new place in the tree. 




If fanout ofP>4 





Fig. 2. Local Gonfigurations Tested. 



Notice that a Join procedure is not exclusively performed by TBGP entities that 
have not joined the tree yet. But, even a TBGP Entity that has already joined 
the tree may be forced to start a Join procedure, to find a new place in the 
tree. It should also be noted that the algorithm always finishes, as GO messages 
always send a TBGP entity (and the associated TBGP subtree whose root is the 
corresponding TBGP entity) down the TBGP tree. 



Additional Rules for Tree Construction. In order to improve the efficiency 
(and the shape) of the control tree, a hierarchical organization of receivers in 
“domains” is also enforced, so that receivers belonging to the same domain 
belong to the same sub-tree. 

Receivers declare their domainlD when connecting to a new candidate parent 
(the domainlD is a 32 bits identifier, e.g. IP address && netmask). The tree 
root can then elect a domain root node for each domain. For instance, the first 
node joining from a given domain can be elected as the domain root of its 
domain. Domain root nodes find their place in the tree with the same mechanism 
described in the previous section, starting from the tree root. 

When a new node wants to join the tree, it is immediately redirected by the 
tree root to its domain’s domain root with a GO message. The Join procedure 
described in the previous section then starts when the node sends the HELLO 
message to its domain root. 

The following 2 constraints are also enforced, to keep all the nodes of the 
same domain “clustered”: 
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1. Rule 1: A node P will discard configurations in which a node from its own 
domain is a child of a node from a different domain. 

2. Rule 2: To keep domain roots as high as possible in the tree (i.e. as close 
as possible to the tree root), configurations in which a node P keeps more 
than one node from its own domain as children, and sends a domain root of 
a different domain as a child of one of its children, are discarded. 

Figure 0 illustrates discarded configurations, where domains are represented by 
“colour codes” . 




3. (a): 
Rule 1. 




3.(b): 
Rule 2. 



Fig. 3. Discarded Configuration Due to Rule Violation. 



3.2 Cost Function and Measurements 

In section 13. 1 L we have seen that a score is computed for each “local” configu- 
ration involving a parent node, its children and a “newcomer” . These scores are 
obtained by applying to the local configurations a score function whose inputs 
are the distances among the nodes involved and/or any other relevant “metric” 
(e.g. domainIDs, node fanouts, etc.). Because the score function is used in each 
local decision taken by the TBCP nodes in the tree, the score function influences 
the final (i.e. global) shape of the tree. Of course, what constitutes a good shape 
for an overlay tree depends on the purpose for which the tree is built. Likewise, 
the notion of “distance” between tree nodes, used as input to the score func- 
tion, can be represented by any metric (or set of metrics) of interest (e.g. delay, 
throughput, etc.) depending on the problem at hand. 

Since score functions and distance evaluation mechanisms depend on the 
purpose of the tree, for flexibility, they are considered as “modules” rather than 
part of the core TBCP protocol. 

For example, for an overlay used for reliable multicast, the following could 
be selected: 

— distance D{i,j) = round trip time (RTT) between node i and node j along 
the fre^; 

Hence, if node i is the parent of node k that, in turn, is the parent of node j, then 
D(i,i) = D{i,k) +D{k,j). 
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— score function = max\jMe{Ci}uND{P, M), where {Ct} is the set of P’s chil- 
dren and N the newcomer. 

The chosen configuration is the one with the smallest score. We therefore see 
that our strategy is to try and maximise the “control responsiveness” between 
nodes. 

4 Simulation Results 

We have simulated (with NS-2 fT^ l our algorithm, with the score function pro- 
posed in section 13.21 in order to evaluate their suitability for purpose. At this 
stage, because we are mainly concerned with the characteristics of our trees, 
no background traffic is used and hop counts can therefore be used as distance 
metric instead of RTTs. 

We have used topologies composed of: 

— a core network of 25 routers, with each router connected to any other with 
a random probability of 10%; 

— 15 stub networks of 10 routers each, with each router connected to any other 
router in the same stub with a random probability of 5%; 

— each stub network is connected to the core by a single link, with each stub 
connected to a different core router. 

Furthermore, the nodes running our algorithm (root/receivers) are “hosts” con- 
nected to stub routers only, with at most one receiver host connected to one stub 
router. The hosts are distributed randomly in the stub networks which deter- 
mine domains. We therefore see that these topologies correspond to a worst-case 
scenario, as the distance between receivers and between stub networks is mini- 
mum 3 hops and there is no tendency to “cluster” the participating hosts. The 
results obtained should therefore indicate upper-bounds results. 

We have simulated scenarios where all the TBCP entities have an identical 
fanout of respectively 2,3,4 and 5. 

For each value of the fanout, groups of 3, 5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 
100 and 150 receivers have been incrementally tested on a same topology, that 
is 3 receivers joined the tree, then 2 were added, then 10, etc. This scenario was 
repeated 10 times (with a different topology each time) for each fanout value. 

Since, in the context of this paper, we are interested in the performance of 
the TBCP trees for control purposes, the mean and maximum distance measured 
between any receiver (i.e. node) in the tree and its parent, as well as the distance 
between any receiver and its domain root (see section tUJ), are of prime concern. 
These are depicted in figure® In figure we see that as the size of the group 
increases, the mean distance between nodes and their parents decreases. This is 
because the larger the group is, the more “clustering” appears, and the algorithm 
is thus efficient at exploiting such clustering among receivers. The maximum 
distance observed between a node and its parent is due to the “interconnection” 
of different domains and depends on the topology. 
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4. (a): Mean and Maximum Dis- 
tance from Parent. 



4.(b): Mean and Maximum Dis- 
tance from Domain Root. 



Fig. 4. Distances Measuring the “Locality” Efficiency of the TBCP Trees. 



Figure [4. (b)| shows that control operations confined within a domain would 
see small response times, with a mean distance between any node in a domain 
and its domain root (the root of the subtree covering the domain) increasing with 
a small slope. The position of the domain root within the domain has of course 
a great influence on the node-domain root distances, especially the maximum 
distance. Because in these simulations, the domain root was chosen to be the 
first node of a domain to join the tree and was therefore randomly placed within 
a domain, the maximum values in figure |4.(b)| are therefore likely to represent 
worst-case scenarios. 

The mean and maximum distances measured between any receiver and the 
tree root is depicted in figure 0 This figure represents the performance observed 




Fig. 5. Mean and Maximum Distance from (Tree) Root. 



when traversing the tree from root to leaves. The distances between the root and 
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the leaf nodes is less critical in the control scenario considered here than if the 
overlay tree were to be used for data transfers. As we expect, the values observed 
stress the importance of the value of the fanout in reducing overall delay along 
the tree, with the mean delay for a fanout of 2 being (roughly) 1.5 times the 
maximum delay for a fanout of 5. 

The tree can also be “globally” characterized by investigating mean and 
maximum delay ratios and link stresses. The delay ratio for a receiver is defined 
as the ratio of (the delay from the root to this receiver along the tree) to (the 
delay from the root to this receiver with “direct” (unicast) routing). The link 
stress is simply a count of how many replicates of the same data appear on a 
link, when that data is “multicast” along the TBCP tree. These are depicted in 
figures ini Figure 6. (a) shows that, on average, the distance between the tree root 




6. (a): Mean and Maximum Delay 6.(b): Maximum Link Stress. 

Ratios. 



Fig. 6. Global Characteristics of Tree. 



and any node along the tree is on average about 2 to 4 times the direct distance 
between the same nodes, which shows that the algorithm, and score function, 
exhibits a tendency to grow the trees “in width”. The figure also shows that 
the value of the fanout has a dramatic effect on the maximum delay penalty 
observed on the trees. 

Figure |6.(b)| shows the expected reduction in load on network links when an 
overlay tree is used as opposed to a “reflector” which uses unicast communica- 
tions from the root to each receiver. Indeed, in the reflector case, the maximum 
link stress is equal to the group size, as the link directly out of the root must 
carry all the data packets. 

Finally, figures |7.(a)| and |7.(b)| illustrate the measurement sampling overhead 
of the protocol (in terms of the number of nodes sampled) during the initial 
joining period for a newcomer (i.e. from the moment the newcomer issues a JOIN 
and it receives a WELCOME message). The overhead measured is therefore 
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proportional to the joining latency of a new node. Figure |7.(a)| shows that the 




7. (a): Mean and Maximum Mea- 
surement Samples. 




7.(b): Mean Measurement Sam- 
ples in % of Population. 



Fig. 7. Measurement Sampling Overhead 



maximum number of nodes sampled is kept well below the number of nodes in 
the tree and is, indeed, very reasonable. It also shows that the average number of 
measurement samples taken by newcomers shows a sub-linear increase in terms 
of the gro up size . 

Figure 7.(b) shows the percentage of the already existing population of the 
tree that is sampled by a newcomer. The shape of these curves should not be 
any surprise. Indeed, only nodes that are siblings of the nodes constituting the 
branch between the root and the final place of the newcomer are sampled in 
addition to nodes in this branch. Therefore, the more branches that are added 
to the tree, the bigger the proportion of the tree population will be ignored by 
each move of the newcomer along its branch. 

These results presented in figure 0 show the scalability of the proposed tree 
building mechanism. 



5 Conclusions 

We have described an efficient and scalable TBCP algorithm/protocol to build 
such an overlay spanning tree. Our TBCP protocol does not rely on any special 
support from routers, nor on any knowledge of the network topology. 

Because our algorithm builds the tree through a series of local decisions 
involving only a small subset of nodes for each decision, a joining node need 
only to get to know a few members of the multicast groups. Also, because our 
algorithm is decentralised, several new members can be in the process of joining 
the tree simultaneously without incurring additional overhead, which is a good 
scaling property for this type of protocols. 
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In the light of our simulation results, we believe our proposal constitutes a 

good candidate for scalable overlay tree building. We have also designed a leave 

procedure which, because of the lack of space, was not presented in this paper. 
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Abstract. Delivering popular web pages to the clients results in high 
bandwidth and high load on the web servers. A method to overcome this 
problem is to send these pages, requested by many users, via multicast. 
In this paper, we provide an analytic criterion to determine which pages 
to multicast, and analyze the overall saving factor as compared with 
a unicast delivery. The analysis is based on the well known observation 
that page popularity follows a Zipf-like distribution. Interestingly, we can 
obtain closed-form analytical expressions for the saving factor, that show 
the multicast advantage as a function of the site hit-rate, the allowed 
latency and the Zipf parameter. 



1 Introduction 

One of the largest problems in the web is to deliver the content efficiently from 
the site to the user. High load on the server and on the network leads to long 
delays or more extremely denial of services. Increasing the capacity for delivering 
the content results in a high cost of extra servers and extra bandwidth. Moreover, 
the capacity is planed to some value, though larger than the average load, but 
almost always cannot accommodate the peak load. This is specially correct for 
popular pages were the access pattern may be unpredictable and very unstable 
(e.g. the famous Starr report case). 

There are several methods to try to overcome the problem. One is to use 
caches miaq. However, caches are not effective for frequently changing content 
or for long files (e.g video, audio). A different possibility that we consider in this 
paper is to use multicast pQIBl E|, i-e., to deliver the content simultaneously to 
many (all) users via multicast dynamic tree. Obviously, one may also combine 
both caching and multicasting to further improve the solution. 

At first, it may seem that multicast could be effective only if many users 
requests exactly the same content at exactly the same time, which can occur 
mainly in real time events. However, it is well known (see, e.g., |S|) that one 
can cyclicly transmit by multicast a page until all users requested the page in 
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the multicast tree receive it. Note that each user needs to receive one cycle from 
the time that he joins the tree (which does not need to be a beginning of a 
new cycle) assuming that there are no faults. A more efficient methods that 
overcomes possible packet losses can be achieved by using erasure codes, e.g., 

The multicast advantage is manifested by combining together overlap re- 
quests to a single transmission. This way the server load and bandwidth decrease 
dramatically since all overlapped users appear almost as a single user. Hence, 
the most attractive pages (files) to multicast are pages that are popular, i.e., 
have many hits per second, and pages that are large. Fortunately, the access 
pattern to pages of a site are far from being uniform. Any non-uniformity on 
the distribution of the access pattern to pages enhances the advantage of using 
multicast since it results in more popular, hence higher concurrency, pages. It 
has been observed 0 EniEi that indeed the access pattern for pages in a site 
is highly non-uniform and obeys a Zipf-like distribution with a parameter that 
is in the range of 1.4 — 1.6. With this distribution, a fixed number of pages ac- 
count for almost all requests for pages (say 95%). As in many other events, Zipf 
distribution occurs naturally, and so we assume that this is the request pattern 
in order to obtain quantitative expressions for the multicast advantage. We will 
present the results in terms of the Zipf parameter a and note that even for the 
pure Zipf distribution, i.e. for parameter a = 1, and furthermore even for Zipf- 
like distribution with a < 1, a small number of pages (maybe not as small as 
for a > 1) still account for most of the requests. Since a Zipf-like distribution 
has a heavy tail, assuming such a distribution on the access pattern is one of the 
weakest possible assumptions in terms of the advantage of multicast. 

It is worthwhile to mention that the popular pages may change over time. An 
appropriate system that keeps track of the access pattern can easily maintain the 
list of the hot pages. Hence, such a system can decide which pages to multicast 
at each point in time according to the estimated parameters of the access rate 
and the size of the pages. 

We next discuss the results of this paper. We start, in section 2, by an anal- 
ysis of a site in which all the information regarding the access pattern and file 
distribution is given. The analysis is based on a criterion we derive, that deter- 
mines which pages to multicast. This criterion assumes that the page access rate 
is given, or estimated, and it also depends on the allowable delay to receive the 
page, which in turn, determines the bandwidth in which the page is multicasted. 
The major result of our paper appears in section 3, and contains a set of analyt- 
ical expression for the gain in bandwidth (and server load) in serving a typical 
site by selective multicast (i.e., multicast of hot pages) as compared with the 
standard unicast serving. For the typical site we assume that the access pattern 
follows a Zipf-like distribution with some parameter a. The overall saving band- 
wiz factor achieved depends on the access rate to the site and the latency that 
we allow for pages. Section 4 extends the analysis to a site with various typical 
file groups. The paper is summarized in section 5. 
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2 Analysis for a Given Site 

We make the following notations 

— n the number of pages in the site. 

— Pi probability of requesting page i for 1 < i < n given that a page was 
requested from the site. 

— Si is the size of page i, in bits, for 1 < i < n. 

— A the average access rate in hits per unit time, to the site. We note that 
X = NXq where N is the size of the population accessing the site and Aq is 
the average access rate of a person from the population to the site. 

As a step toward an analysis for a typical site we make an analysis for a 
given site with the probably unrealistic assumption that all the above parameters 
{n,pi, Si, X) are known. In this section we first compute the minimal required 
bandwidth to serve this site by unicast. We then consider serving the site by 
selective multicast, where we first determine which pages worth multicasting 
and then compute the resulting bandwidth. By that we estimate the gain in 
serving this site by multicast. Note that we assume that the site is planned to 
have the ability of serving all requests and not to drop/block some of them. 

2.1 Serving by Unicast 

Using the above notation the amount of bits per unit time generated on the 
average in serving the page i is XpiSi. Consider now 



This formula is the information theoretic lower bound on the required band- 
width for serving all the pages by unicast, since the total average number of bits 
requested per unit time must by equal (on the average) to the total number of 
bits transmitted. Note that the lower bound is independent of the transmission 
rate of the pages. Moreover, the above formula stands for the minimum possible 
bandwidth in the ideal case where we can store the requests in a queue and out- 
put continuously exactly the same number of bits per time without any bound on 
the latency encountered for delivering the files. The actual bandwidth required 
by any practical system to support all requests (in particular, with bounded la- 
tency) needs to be higher than this. Nevertheless, we essentially demonstrate the 
multicast bandwidth advantage by showing that multicast requires less (some- 
times much less) bandwidth than this information theoretic bound. 

2.2 Serving by Selective Multicast 

In serving a file i by multicast, a carousel transmission (or better, a coded stream 
using, e.g., Bandwiz block-to-stream code d) of the file is transmitted at some 



n 



n 
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particular bandwidth Wi and all requests for the file are handled by receiving 
from this multicast transmission. The bandwidth advantage in serving a file this 
way comes from the fact that the file is served at the fixed bandwidth Wi and 
this bandwidth allocation is sufficient no matter how many requests the file has 
during its transmission. In unicast, on the other hand, each request requires an 
additional bandwidth allocation. 

One may conclude that multicast can lead to an unbounded saving compared 
with unicast, simply by allocating a small bandwidth Wi to serve the file i. 
But there is a price for that. The latency in receiving the file, whose size is 
Si will become large. A reasonable multicast bandwidth allocation is such that 
the desired latency Li is guaranteed. Note that the information theoretic lower 
bound computed for unicast was independent of the latency we allow to deliver 
any file (although the realistic bandwidth, higher than that, does depend on it as 
discussed above). Thus, as the allowed latency is larger, the multicast advantage 
is larger. 

In view of this discussion, we assume that the bounds on the latencies for 
the various files are imposed on the system. We use the following definitions: 

~ Let Li be the latency we allow for delivering page i using multicast. 

— Thus, Wi = Si I Li is the rate that we chose to transmit page i. 

We note that the value of Wi and Li are functions of the typical capability of 
the receivers and network conditions. For example, Wi should not be larger than 
the typical modem rate if typical receivers access the site through a modem. 
This implies that Li cannot be small for large files. Also for small files it does 
not pay to have small Li since creating the connection from the receiver to the 
site would dominate the delay. Hence we conclude that Li is never very small 
and may be required to be reasonably large. As will be seen, the larger the Lj, 
the better is the multicast advantage. 

Out of the bandwidth allocated to unicast, the portion of the minimal band- 
width required to transmit the file i is XpiSi (which is the amount of bits per unit 
time requested of this file). Thus, in using multicast, we reduce the bandwidth 
to all the pages in which 

XptSi > Wi 

and in this case we replace XpiSi by the bandwidth by Wi. The above formula, 
which provides the criterion for transmitting the file by multicast, is equivalent 
to 

XpiLi > 1 . 



Hence we conclude that the total bandwidth required by the selective multi- 
cast is 



Bm= ^ Wi+ ^ Xp^Si . 

i\XpiLi>l i\XpiLi<l 
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3 Analysis for a Typical Site 

We consider a site, where the various pages can be partitioned into typical 
groups. In each group the pages are of similar characteristics, i.e. approximately 
the same size and same required latency for delivery to the user. For example, 
one group can be text HTML files, another group can be pages with images 
and yet another group can be audio, or video files. We first consider one such 
group of pages. It is well known and has been consistently observed that the 
access pattern to the pages in a group is not uniform. In fact, the advantage of 
multicast improves as the distribution becomes less uniform since one needs to 
multicast less pages to deliver the same fraction of the traffic. We make one of 
the weakest possible assumptions on that distribution, i.e., a family of heavy tail 
distributions on the access pattern. If the distribution is more skewed then the 
saving by using multicast increases. 

Assumption. Among a group of pages with the same latency the popularity 
of pages is distributed according to Zipf-like distribution with some param- 
eter a > 0. Specifically, the probability of the i’th most popular page is 
proportional to l/f“ or equal to where C(a) = • 

The above assumption is crucial for our analysis. The typical parameter a. 
which is usually observed for a typical site is in the range 1.4 — 1.6. In the sequel 
we will use the following approximation X)i=a-i-i ~ /a ~ 

+ /a . In particular A « 1 + ^dx . 

Now, we are ready to continue the analysis. First we consider unicast. We 
can approximate the expression 

n 

^ ^ ^ Pi Si 

i=l 

by 

Bu = XE{S) 

where E{S) is the expected size of a random page in the group. 

Using the Zipf-like distribution we can evaluate the total bandwidth required 
by multicast. Recall that it is worthwhile to multicast a page if XpiL > 1 (L is 
fixed for all pages in the group) and we should multicast the most popular pages 
regardless of their size. Let k be the number of such pages that are worth to 
multicast. Then k is the largest integer that satisfies XpkL > 1 or 

1 „ ^ 1 
C{a)k°‘ ~^’"~XL' 

Following the above formula there are three different cases that we need to 
analyze according the to value of the smallest k that satisfies the above formula: 

— No need to multicast any page. This is the case where the access rate is small 
and the required latency is so short that it is not worthwhile to multicast even 
the most popular page (smallest A: < 1). That corresponds to XL < C(a). 
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— Multicast all pages. Here the access rates are high or the number of pages 
is relatively small such that it is worthwhile to multicast all pages {k > n). 
Here all pages are popular which corresponds to XL > C{a)n°‘. 

— Multicast popular pages. This is the typical case where 1 < k < n and we 
multicast only the popular pages according to our metric. This corresponds 
to C{a) < XL < C(a)n“. 

Clearly, in the first case multicast saves nothing. Later we discuss the saving 
when we multicast all pages. We begin, then, with the interesting case where 
1 < k < n, i.e., the case of multicasting only the popular pages. 



3.1 Multicasting the Popular Pages 



In this case we get k = 



(c(m) 



l/o 



where I < k < n. 



If we plug it into the formula of the total bandwidth of the multicast (i.e. 
multicast the first k pages and unicast the rest) we get 



Bjn = '^St/L+ ^ XptSi . 

i—1 z— fc+1 

Since the pages in a group have similar characteristics in terms of size and 
required latency we can approximate the above by the following 



B„ 



E{S) E{S)X 



C J c 

E{s)x 



-dx 



—dx 



a;- 



where we drop the integer value and we set 

/■” 1 

C = C{a) = 1 + / —dx . 
Ji 



Next we separate between the case a = 1 and the case a yf 1. For the case 
a yf 1 we also consider asymptotic behavior. 



the Case ol — 1 . Clearly 



dx 



C=l + / — = l+lnn — lnl = lnen 

Ji X 



and 




XL nC n In en 

= In n — In = In - — = In 

O \ 1 j 



XL 
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Hence for the range of the typical case i.e., Inen < XL < nlnen, we have 



Br, 



in en \ 



;lnf 



XL 



, , f lnn + l + ln^f^' 
E{S)X - 



= E{S)X 



In en — In 

In en 

In en 



E{S)X ^1- 



In en 



In 

In en 



In f 



If we compare it to standard unicast, the saving factor is 



R = 



1 



1 - 



In , 



Inf 



Examples of the savings can be seen in Tabled Here A is given in hits per 
second for the site (i.e. total rate for all pages), L is given is seconds (4 seconds 
for html page, 20 seconds for page with pictures and 300 seconds for audio or 
video clip) and n is the number of pages of the site. Plots of R appear in Figure 
El as a function of A (and also for various a’s, see also below). 



A 


L 


n 


saving, q = 1 


200 


20 


10“^ 


2.41 
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Fig. 1. Examples of the Saving Factor for a = 1. 



the Case a. ^ 1. In this case 



c=i+ r^; = i 



— I — Of 



1 — a 1 — a 



and 



Hence for the range 



dx - (^)‘ 



I — a 



— a , n“(n^ — a) 

< AL < 



I — Of 



I — Of 
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Fig. 2. The saving factor (relative to unicast) of the bandwidth (load) of a server 
for multicast with Zipf-like distribution for various values of the parameter a as 
a function of the number of hits per second. The number of pages is 10, 000 and 
the latency is 25 seconds. 



we have 



Br, 



E{S)\ ( f - (^)1/“-i' 

1-0 



C \ \ C 



E{S)X f f 1_^ . n 



C \\C 



1-a 



1 — Q 1 — a 



E{S)X ( 

(l-a)c(“Uj ) 

E{S)X ( /AL(l-o)y/“”^ 



/_ /A^y 

\ \ vX~°‘ — a J 



We conclude that the saving factor compared with unicast is 

-g 

I a _Q, j 

Again, plots of i? as a function of A and various o’s appear in Figure |21 



96 



Yossi Azar et al. 



Asymptotic Expression - ct > 1. It is interesting to consider the asymptotic 
behavior of the saving factor for a site, as the number of pages grows. It is 
not hard to show that the saving function is monotone non increasing with the 
number of pages. Moreover, for the case a > 1, it turns out that the saving 
factor approaches to a limit which is bounded away from I. Hence, to bound 
the saving factor for any number of pages we can assume that the number of 
pages n approaching infinity. The saving factor R in the asymptotic case, which 
as will be seen has a simpler expression (independent of n), is a lower bound on 
the saving factor for any n (i.e. we save at least that much). This is very useful 
since the number of pages in a site is usually large and continuously growing. 

For evaluating the asymptotic behavior we approximate the expression for R 
by replacing with zero. Then for the range < XL we have 



Br, 



E{S)X f 



—a 






= E{S)X{XL{l-l/a))^^°‘~^ . 



Hence the saving factor relative to unicast is 
i?= (AL(1- 



and it is independent of n. 

The saving factor of the total bandwidth for a site (including both unicast 
pages and multicast pages) yields by multicasting the relevant pages can be 
found in Figure Olfor a = 1.4, a = 1.6 and a = 1.8 for few examples. 



A 


L 


saving, a = 1.4 


saving, a = 1.6 


saving, a = 1.8 


200 


20 


7.48 


15.25 


27.82 


200 


4 


4.72 


8.49 


13.60 


20 


300 


8.39 


18.07 


33.31 



Fig. 3. Examples of the Saving Factor for a = 1.4, a = 1.6 and a = 1.8. 



Asymptotic Expression - a < 1. Now assume that a < 1. For the asymptotic 
behavior we can approximate the expression by assuming that is relatively 
large compare to a (i.e n is relatively large). Then for the approximate range 



T 1—Ot 



n 



I — a 



< XL < 



1 — a 
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we have 



Br, 






Hence the saving factor is 
R = 



1 — a (Ai(l — Qf)/n)^^“ ^ 



relative to unicast. This expression depends on n (as n goes to infinity, the saving 
factor goes to 1, i.e., no saving) but it is a simpler expression than above. 



3.2 Multicast All Pages 

Here we multicast all pages i.e., k = n which corresponds to the range AL > 
C(a)n“. We have Bm = = E{S)n/L. If we compare it to unicast, we 

get that the saving factor is 




n 

It is worthwhile to note that the above saving factor holds for all values of 
a. The range for achieving this saving factor is XL > nlnen for a = 1 and 
XL > ” for a yf 1. The range for a yf 1 can be approximated by the 

range XL > for a > 1 and XL > for a < 1. 

It is also worthwhile to mention that the case a = 0 (i.e. uniform distribution) 
always falls in the extreme case or the low traffic. That is if XL > n it is worth 
while to multicast all pages and otherwise it is not worthwhile to multicast any 
page. 



3.3 Properties of the Saving Function 

We list the following useful observations: 

~ The saving function is continuous monotone non-decreasing as a function of 
XL for any given a and n in the admissible range. This can be easily proved 
by considering the saving function directly. 

— The saving function is continuous monotone non-increasing as a function of 
n for any given a and XL in the admissible range. This can be easily proved 
for 0=1. For a ^ 1 this can be proved by showing that the saving function 
is monotone in n“ — a which is monotone in n. 

— The saving function seems to be continuous monotone non-decreasing as a 
function of a (also at a = 1) for any given n and XL in the admissible range. 
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4 A Site with Various Groups 

In this section we assume that not all pages have a similar size and latency. We 
partition the files into r groups where in each group the files are of approxi- 
mately the same size and latency. For group i we denote by fl{Ej{S),Xj)) the 
average bandwidth required for group j using the standard unicast serving and 
by Xj, Lj) the average bandwidth required for group j using multi- 

cast. Recall that we do not limit the number of pages that we multicast and 
hence, the decision if to multicast a page does not conflict with the decisions 
to multicast other pages. Hence the overall bandwidth is superposition of the 
bandwidth of the individual groups. Thus, we have that the total bandwidth 
used in unicast is 

r 

Y.fm{s),x,)) 

i=i 

where fi{Ej{S), Xj)) = XjEj{S). The total bandwidth for multicast serving is 

r 

J2fUEj{S),X,L,) 

i=i 

where for group j of the extreme case 



fUE,{S),X„L,)=E,{S)n/L, 
and for group j of the typical case with a = 1 



f^{E,{S),Xj,L,)=E,{S)X, 1- 



In 



AL,- 
In erij 



In erii 



where for a ^ 1 



f^(E,(S),X„L,) = ^4r^ 

rij - aj 



, V 1/aj-l 



1 — Oii 

rij - aj 



5 Summary 

Our main contribution in this paper is the analytical analysis of the saving 
factor that can be achieved by using multicast versus using unicast in serving 
a a typical site. The analysis assumes the Zipf-like distribution for the access 
pattern for the pages in the site. We note that for the most interesting case 
where the parameter a. of the Zipf-like distribution is larger than 1 the saving 
factor is almost independent of the number of pages (i.e the site may contain a 
huge number of pages). We also note that a crucial parameter in determining 
the saving factor is the product between A and L which is the access rate for a 
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group and the maximum latency we are allowed to deliver the files. We have also 
designed a simple criterion for a given site to decide in advance (or dynamically 
while collecting the information on the access pattern for the site) which pages 
to multicast and which pages to continue to transmit with the standard unicast. 

We note that the saving factor can be further improved, if we further consider 
the peak behavior and not the average behavior of the requests. In this case the 
requirement for unicast bandwidth grow, while the requirement for multicast 
is stable. We can change somewhat the criterion of which pages to multicast - 
instead of comparing the average required rate for sending a page in unicast to its 
multicast bandwidth, we compare the instantaneous demand. The exact analysis 
in this case requires assumptions regarding the stochastic access pattern. Recent 
studies show that requests are not coming as, say, a Poisson process, but have 
a self-similar heavy tail distribution (see e.g. f2I E|). Thus, this analysis can 
be complicated. Still, an approximation for the true saving can be obtained by 
using the results derived here, and choosing for A a higher value, that will reflect 
the peak demand instead of the average access rate. 
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Abstract. Existing approaches for multirate multicast congestion con- 
trol are either friendly to TCP only over large time scales or introduce 
unfortunate side effects, such as significant control traffic, wasted band- 
width, or the need for modifications to existing routers. We advocate a 
layered multicast approach in which steady-state receiver reception rates 
emulate the classical TCP sawtooth derived from additive-increase, mul- 
tiplicative decrease (AIMD) principles. Our approach introduces the con- 
cept of dynamic stair layers to simulate various rates of additive increase 
for receivers with heterogeneous round-trip times (RTTs), facilitated by 
a minimal amount of ICMP control traffic. We employ a mix of cumu- 
lative and non- cumulative layering to minimize the amount of excess 
bandwidth consumed by receivers operating asynchronously behind a 
shared bottleneck. We integrate these techniques together into a conges- 
tion control scheme called STAIR which is amenable to those multicast 
applications which can make effective use of arbitrary and time-varying 
subscription levels. 



1 Introduction 

IP Multicast will ultimately facilitate both delivery of real-time multimedia 
streams and reliable delivery of rich content to very large audience sizes. One of 
the most significant remaining impediments to widespread multicast deployment 
is the issue of congestion control. Internet service providers and backbone service 
providers need assurances that multicast traffic will not overwhelm their infras- 
tructure. Conversely, content providers in the business of delivering content via 
multicast do not want artificial handicaps imposed by overly conservative multi- 
cast congestion control mechanisms. Resolution of the tensions imposed by this 
fundamental problem in networking motivates careful optimization of multicast 
congestion control algorithms and paradigms. 

While TCP-friendly multicast congestion control schemes which transmit at 
a single rate now exist, these techniques cannot scale to large audience sizes. The 
apparent alternative is multirate congestion control, whereby different receivers 
in the same session can receive content at different transfer rates. Several schemes 
for multirate congestion control using layered multicast TOlilliii) have been 
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proposed. Also, an excellent survey of related work on multicast congestion con- 
trol appears in m A layered multicast approach employs multiple multicast 
groups transmitting at different rates to accommodate a large, heterogeneous 
population of receivers. In these protocols, receivers adapt their reception rate 
by subscribing to and unsubscribing from additional groups (or layers), typi- 
cally leveraging the Internet Group Membership Protocol (IGMP) as the control 
mechanism. Also, these schemes tend to employ cumulative layering, which man- 
dates that each receiver always subscribe to a set of layers in sequential order. 
Gumulative layering dovetails well with many applications, such as those which 
employ layered video codecs 0 for video transmission and methods for reliable 
multicast transport which are tolerant to frequent subscription changes dlEl. 

In conventional layering schemes, the rates for layers are exponentially dis- 
tributed: the base layer’s transmission rate is Bq, and all other layers i transmit 
at rate Bq * 2*“^. Therefore, subscription to an additional layer doubles the re- 
ceiver’s reception rate. Reception rate increase granularity of those schemes is 
unlike TGP’s fine-grained additive-increase, multiplicative decrease (AIMD). Be- 
cause of this coarse granularity, rate increases are necessarily abrupt, which runs 
the risk of buffer overflow; therefore, receivers must carefully infer the available 
bandwidth before subscribing to additional layers. 

A different approach advocating fine-grained multicast congestion control to 
simulate AIMD was proposed in |^. We refer to this approach as FGLM (Fine- 
Grained Layered Multicast). FGLM relies on non- cumulative layering and careful 
organization of layer rates to enable a receiver to increase the reception rate at 
the granularity of the base layer bandwidth Bq. Unlike earlier schemes, in this 
scheme, all receivers act autonomously with no implicit or explicit coordination 
between them. One substantial drawback of this approach is a constant hum 
of IGMP traffic at each last hop router (1 join and 2 leaves per client at every 
additive increase decision point). This volume of control traffic is especially prob- 
lematic for last hop routers with large fanout to one multicast session, or those 
serving multiple sessions. Another drawback is that this approach incurs some 
bandwidth dilation at links, wasted bandwidth introduced by the uncoordinated 
activities of the set of downstream receivers. Finally, the use of non-cumulative 
layers is only amenable to applications which can make use of an arbitrary (and 
frequently changing) subset of subscription layers over time. The most natural 
applications of which we are aware are those in which any packet on any layer is 
equivalently useful to every receiver; such a situation arises in the digital foun- 
tain approach defined in which facilitates reliable multicast by transmitting 
content encoded with fast forward error correcting codes. 

Our work presents a better method for simulating true AIMD multicast 
congestion control. At a high level, our STAIR (Simulate TGP’s Additive In- 
crease/multiplicative decrease with Rate-based) multicast congestion control al- 
gorithm enables reception rates at receivers to follow the familiar sawtooth pat- 
tern which arises when using TGP’s AIMD congestion control. We facilitate this 
by providing two key contributions. First, we define a stair layer, a layer whose 
rate dynamically ramps up over time from a base rate of one packet per RTT 
up to a maximum rate before dropping back to the base rate. The primary ben- 
efit of this component is to facilitate additive increase automatically, without 
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the need for IGMP control messages. Second, we provide an efficient hybrid ap- 
proach to combine the benefits of cumulative and non-cumulative layering below 
the stair layer. This hybrid approach provides the flexibility of non-cumulative 
layering, while mitigating several of the performance drawbacks associated with 
pure non-cumulative layering. While our STAIR approach appears complex, 1) 
the algorithm is straightforward to implement and easy to tune, 2) it delivers 
data to each receiver at a rate that is in very close correspondence to the be- 
havior of a unicast TCP connection over the same path, and 3) it does so with 
a quantifiable and reasonable bandwidth cost. 



2 Definitions and Building Blocks for Our Approach 

In order to motivate our new contributions, we begin with techniques from previ- 
ous work which relate closely to our approach. In 0, four metrics for evaluating 
a layered multicast congestion control scheme are provided, two of which we 
recapitulate here. 

Definition 1. T/ie join complexity of an operation (sueh as additive increase) 
under a given layering scheme is the number of multicast join messages a receiver 
must issue in order to perform that operation in the worst case. 



Definition 2. For a layering scheme which supports reception rates in the range 
and for a given link I in a multicast tree, let Mi < R be the maximum 
reception rate of the set of receivers downstream of I and let Ci be the bandwidth 
demanded in aggregate by receivers downstream of 1. The dilation of link I is 
then defined to be Ci/Mi. Similarly, the dilation imposed by a multicast session 
on tree T is taken to be maxi^riCi / Mi) . 

Table 1 compares the performance of various layering schemes which attempt 
to perform AIMD congestion control. Briefly, one cannot perform additive in- 
crease in a standard cumulative protocol, and while non-cumulative schemes 
can do so, they do so only with substantial control traffic and/or bandwidth 
dilation per operation. 



Table 1. Performance of AIMD Congestion Control for Various Approaches. 



Sequence 


Dilation 


Complexity of AI 


Complexity of MD 


Ideal 


1 


zero 


zero 


Std. Cum 


1 


N/A 


1 leave 


Std. NonCum 


2 


O(logR) 


O(logA) 


FGLM 0 


1.6 


2 joins, 1 leave 


1 leave 



^ Standard refers to the doubling scheme described in the introduction. 
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We briefly sketch one non-cumulative layering scheme used . The layering 
scheme is deflned by Bq = 1, Si = 2, and Bi = Bi-i + Si _2 + 1 for i > 2. The 
first few rates of the layers for this scheme are 1, 2, 4, 7, 12, 20, 33,..., where 
the base rate can be normalized arbitrarily. Increasing the reception rate by one 
unit can be achieved by the following procedure: Choose the smallest layer i > 0 
to which the receiver is not currently subscribed; then subscribe to layer i and 
unsubscribe from layers i — 1 and i — 2. A receiver can approximately halve its 
reception rate by unsubscribing from its highest subscription layer. While this 
does not exactly halve the rate, the decrease is bounded by a factor which lies 
in the interval from approximately 0.4 to 0.6. 

One salient issue with FGLM is that the base layer bandwidth Bq is fixed 
once for all receivers. Setting Bq to a small value mandates frequent subscription 
changes (via IGMP control messages) for receivers with small RTTs. Setting it 
to be large causes the problems of abrupt rate increases and buffer overruns that 
FGLM is designed to avoid. 

3 Components of Our Approach 

In this section, we describe our two main technical contributions. The first con- 
tribution is a method for minimizing the performance penalty associated with 
non-cumulative layering by employing a hybrid strategy which involves both cu- 
mulative and non-cumulative layers. Our approach retains all of the benefits of 
fine-grained multicast advocated in P], with the added benefit that the dila- 
tion can be reduced from 1.62 down to 1 -I- e with only a small increase in the 
number of multicast groups. The second contribution introduces new, dynamic 
stair layers, which facilitate fine-grained additive increase without requiring a 
substantial number of IGMP control messages. Taken together, these features 
make the fine-grained layered multicast approach much more practical. 

3.1 Combining Cumulative and Non-cumulative Layering 

In a conventional cumulative organization of multicast layers, only cumulative 
layers are used to achieve rates in the normalized range [1 , i?] . 

— Cumulative Layers ( CL): The base layer rate is cq = 1, and for all other layers 

Ci, 1 < i < logo, R, the rate Ci = cq * . When a = 2, this corresponds to 

doubling of rates as each layer is addedQ. 

In the fine-grained multicast scheme of P], only non-cumulative layers are used 
to achieve a spectrum of rates over the same normalized range. 

— Non- Cumulative Layers (NCL) : The non-cumulative layering scheme Fibl 
presented in pj has layers Ni whose rates are specified by the Fibonacci-like 
sequence uq = 1, ni = 2, and Ui = Ui-i Ui -2 -I- 1 for i > 2. 

Note that both GLs and NGLs are static layers for which the transmission rate 
to the layer fixed for the duration of the session. 

^ Bandwidths can be scaled up multiplicatively by a base layer bandwidth Bq in these 
schemes. 
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Fig. 1. (Left) Hybrid Layer Scheme : K = + r, K = 2^ + 5 when a = 2. CL 

denotes Cumulative Layer, NCL denotes Non-Cumulative Layer, 

(Right) Maximal dilation at a link as a function of available link bandwidth. 



In the hybrid scheme which we propose, we will require that both a set of 
cumulative layers Ci and a set of non-cumulative layers Ni are available for 
subscription. To attain a given subscription rate K, a receiver will subscribe to 
set of cumulative layers to attain a rate that is the next lowest power of a, capped 
by a set of non-cumulative rates to achieve a rate of exactly K, as depicted in 
Figure dleft). In particular, we let j = [log,^ K\ and write K = + r, then 

subscribe to layers Cq, ... ,Cj as well as the set of non-cumulative layers {Nr} 
that the FGLM scheme would employ to attain a rate of r. As prescribed by 
FGLM, fine-grained increase (adding cq) requires one join and two leaves, except 
for the relatively infrequent case when we move to a rate that is an exact power 
of a. In this case, we unsubscribe from all non-cumulative layers and subscribe to 
one additional cumulative layer. Multiplicative decrease now requires one leave 
from a cumulative layer and one leave from a non-cumulative layer. Comparing 
against a standard non-cumulative scheme, which used logj^ g R layers, we have 
now added log^, R cumulative layers, or a constant factor increase. What we have 
gained is a dramatic improvement in dilation, expressed as the following lemma. 

Lemma 1. The dilation of the hybrid scheme is 1 -I- 1.62 . 

Proof. We proceed by proving an upper bound on the dilation of an arbitrary 
link i, which gives a corresponding bound on the dilation of the session. For each 
user Uj downstream of £, consider the rate it obtains over cumulative layers Oj 
and the rate it obtains over non-cumulative layers bj separately and denote its 
total rate by uj. Let the user with maximal total rate Oj + bj be denoted by 
U and its rates be denoted by a and b respectively. Now reconsider user Uj. If 
Oj < d, then by the layering scheme employed, Uj = Oj + bj < aoj. Adding abj 
to both sides gives : 

Uj + abj < aOj + abj = a(uj) (1) 

o- IT- • ij L Uj(a-l) u(a-l) 

Simplifying yields: 6, < — < 

a a 
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Otherwise, if aj = a, then by maximality bj < b < (a — l)a. In either case, bj is 
less than sq maxj bj is as well. From the dilation lemma proved in jS|, 

a set of users subscribing to non-cumulative layers experience limiting worst- 
case dilation of 1.62. Thus the total bandwidth consumed by non-cumulative 
layers across ^ is at most 1.62 max^ hj. Plugging these derived quantities into the 
formula in Definition 0 yields: 



o -I- 1.62 max, 6, a -I- 1.62 
Dilation < .. ^ ^ < 



(a — l)n 



< i + im 



(o-l) 



□ 



a 



Applying this lemma to a hybrid scheme with a geometric increase rate of 
a = 1.2 on the cumulative layers realizes the benefits of a non-cumulative scheme, 
reduces the worst-case dilation in the limit from 1.62 to 1.27 (a 22% bandwidth 
savings) and requires only a modest increase in the number of groups. Figure 0 
(right) shows the maximal dilation at a link as the link bandwidth varies as a 
function of no for FGLM and the hybrid scheme for two different values of a. 
Recall that in FGLM, there are bandwidth transition points, when clients will 
subscribe to a new maximum layer j and unsubscribe from layers j — 1 and j — 2 
across the bottleneck. At these transition points (spikes in the plot), worst-case 
dilation can be large due to the bandwidth consumed by this new layer. While 
STAIR with a = 2 has comparable dilation to FGLM, STAIR with a = 1.2 has 
substantially smaller worst-case dilation. 



3.2 Introducing Stair Layers 

Our next contribution is stair layers, so named because the rates on these layers 
change dynamically over time, and in so doing resemble a staircase. This third 
layer that a sender maintains is used to automatically emulate the additive- 
increase portion of AIMD congestion control, without the need for IGMP con- 
trol traffic. Different stair layers are used to accommodate additive increase for 
receivers with heterogeneous RTT’s from the source. These layers also “smooth” 
discontinuities between subscription levels of the underlying GLs and NGLs, 
which provide rather coarse granularity (in the subsequent discussion, we as- 
sume that these underlying layers have base rates cq = ng = 1Mbps). Finally, 
we note that the addition of stair layers increases the dilation beyond that proven 
in Lemma 1, but only by a small additive term, which we quantify in the full 
version of the paper. 

— Stair Layers (SL): Every SL has two parameters: 1) a round-trip time t in ms 
that it is designed to emulate and 2) a maximum rate R, measured in packets 
per t ms. The rate transmitted on each SL is a cyclic step function with a 
minimum bandwidth of 1 packet per t ms, a maximum of R, a step size of 
one packet, and a stepping rate of one per emulated RTT. Upon reaching 
the maximum attainable rate, the SL recycles to a rate of one packet per 
RTT. 

Unlike GLs and NGLs, SLs are dynamic layers whose rates change over time. 
Dynamic layers were first used by CH to probe for available bandwidth and 
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Fig. 2. Use of a Stair Layer with t = 128ms, R = 1Mbps, packet size S = 1KB. 
(Left) Rate of SLi 2 s in isolation. (Right) SLi 2 s used in conjunction with under- 
lying non-cumulative layers. 



later defined as such and used in P to avoid large IGMP leave latencies. Figure 
Hleft) shows the transmission pattern of SL 128 (a stair layer for 128ms RTT) 
with maximum rate i? = 16 packets per RTT. Also depicted in Figure 0is a 
third useful parameter of a stair layer: 

Definition 3. The stair period of a given stair layer is the duration of time 
that it takes the layer to iterate through one full cycle of rates. 

Given a stair layer with an emulated RTT t and a maximum rate R the stair 
period p satisfies p = Rt^. Typically, we will set the maximum rate i? of a 
stair layer to be the base rate of the standard cumulative scheme cq (in Mbps), 
in which case we substitute for R and perform the appropriate conversions, 
assuming a fixed packet size S in bytes: p = (fg) t^- 

In practice, the sender will maintain several SLs to emulate a range of dif- 
ferent RTTs. However, the fixed packet size and the maximum rate i? of a stair 
layer give a lower bound on the range of RTT’s that can be accommodated. 
The height of the staircase in steps directly corresponds to the factor in control 
traffic savings that will be achieved. Denoting this minimum desired height by 
h, we require that: t > . For example, with a packet size of 512 bytes, R 

= 1 Mbps, and a desired value of h = 8, then the smallest allowable RTT in an 
SL is 32ms. 

4 The STAIR Congestion Control Algorithm 

We now describe how the techniques we have described come together into a 
unified multirate congestion control algorithm. We employ a hybrid scheme as 
described in Section 13.11 from which each receiver selects an appropriate subset 
of layers, used in concert with one stair layer, appropriate for its RTT. The two 
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most significant challenges to address are providing the algorithms to perform- 
ing additive increase and multiplicative decrease, respectively. Two additional 
challenges we address are 1) incorporating methods for estimation of multicast 
RTTs and 2) establishing a set of appropriate stair layers. 



4.1 Additive Increase, Multiplicative Decrease 

In order for a set of stair layers to complement a set of CLs and NCLs, the 
maximum rate of the stair layer must be calibrated to the base rate of the CLs 
and NCLs. The effect of appropriately calibrated rates can be seen in Figure 0 
at exactly those instants when the stair layer recycles, the subscription rate on 
the NCL’s increases by no, to compensate for the identical decrease on the stair 
layer. Now in order to conduct AIMD congestion control, the receiver measures 
packet loss over each stair period (during which additive increase takes place 
automatically). If there is no loss, then the receiver performs an increase of ng, 
the base bandwidth of the NCLs as described earlier (1 join and 2 leaves or k 
leaves when the stair period is an exact power of a). As an aside, we note that 
it may be much more efficient for a last-hop router to handle such a hatch of 
ICMP leave requests, rather than handling them as k separate requests. 

Conversely, if there is a packet loss event in a stair period (of one or more 
losses), then one round of multiplicative decrease is performed. Approximately 
decreasing the rate by half is straightforward - it is necessary to drop the top 
cumulative layer as well as the top non-cumulative layer. While existing non- 
cumulative layering schemes do not easily admit dropping rates by exactly a 
factor of two, the consequences are mitigated substantially in a hybrid scheme; 
moreover, our experimental results indicate that the level of TCP-friendliness 
which can be achieved using our approach remains very high. We also note 
that there is no particular reason to wait until a stair period terminates before 
conducting multiplicative decrease - it can be done any time. Since the STAIR 
receiver unsubscribes and subscribes frequently to increase the rate, IGMP leave 
latency could be problematic. One solution is to perform joins and leaves in 
advance of when those operations need to take effect; we are also hopeful that 
subsequent versions of IGMP will accommodate fast IGMP leaves so that we 
can use them directly to respond to congestion in a timely fashion. 



4.2 Configuration of Stair Layers 



As motivated earlier, to accommodate a wide variety of receivers, stair layers 
must be configured carefully. We choose to space the RTTs across the available 
stair layers exponentially. Let RTT in the base Stair Layer be 2^ ms. The base 
Stair layer increases its sending rate every 2®ms and all the other stair layers 
j will increase the sending rate in every 2-^+® ms. The TGP throughput rate 
R, in units of packets per second, can be approximated by the formula in mg: 

where R is a function of the packet loss rate 



^ i 

“ flTTV?(v^+6A/3729(l+329D)’ 
q, the TGP round trip time RTT, and the round trip time out value i?TO, where 
RTO = iRTT according to jS]. 
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Since the throughput is inversely proportional to RTT, the receiver with a 
small RTT is more sensitive to the throughput than the receiver with large RTT, 
thus we recommend that RTTs provided by stair layers be exponentially spaced. 
Note that with an exponential spacing of stair layers, a receiver may subscribe 
to a different SL if its measured RTT changes significantly: it can can subscribe 
to a faster layer at the end of its current stair period, or drop down to a slower 
stair layer every other stair period. 



4.3 RTT Estimation and STAIR Subscription 

Each receiver must measure or estimate its RTT to subscribe to an appropriate 
stair layer. A variety of methods can be employed to do so; we describe two such 
possibilities, with the expectation that any scalable method can be employed 
in parallel with our approach. Golestani et al. provide an effective mechanism 
to measure RTT in multicast using a hierarchical approach 0. However, their 
approach requires clock synchronization among the sender and receivers and 
depends on some router support which is not widely available. Another simple 
way to estimate RTT is to use one of various ping-like utilities. However, one 
cost associated with use of ping is that as the number of receivers increase the 
sender faces a “ping implosion” problem. We leave efficient RTT estimation for 
future work, noting the need for careful study of the tradeoff between frequency 
of measurement and accuracy of estimation. 

Assuming that the receiver has an estimate of its RTT, its next challenge is 
to subscribe to the appropriate stair layers. Let RTTi be the RTT in SLi and 
RTTm be the measured RTT. The receiver can subscribe to appropriate stair 
layers based on the measured RTT in the following way. If the RTTm is within a 
(l-|-e) factor from RTR for some i, simply subscribe to SLi. A reasonable choice 
for e which we argue for in the full paper is e = 1/3. Otherwise, to decrease the 
error bound in certain cases, a receiver should subscribe to the two smaller stair 
layers SLi and SLi-i for which RTR < RTTm- 



5 Experimental Evaluation 

We have tested the behavior of STAIR extensively using the NS simulator p. 
The simulation results show that STAIR exhibits good inter-path fairness when 
competing with TCP traffic in a wide variety of scenarios. Our initial topol- 
ogy is the “dumbbell”, with all non-bottleneck links set to 10ms delay and 100 
Mbps bandwidth. In this topology, we vary the cross-traffic multiplexing level 
by varying the number of TCP flows, vary the bottleneck bandwidth, and scale 
the queue size. We then consider the impact of richer topologies, including mul- 
tiple bottleneck links, and TCP cross traffic with both short and long RTTs. 
Throughout our experiments, we set cq = no = 512 Kbps and set a = 2, i.e. 
the rate Ci = 2*“^ * 512 Kbps for i > 0. We employ a fixed packet size of 512B 
throughout. Also, while there is theoretical justification for smaller settings of 
a, we did not observe worst-case dilation often in our simulations. 
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In most the experiments we describe here, we use RED gateways, primarily as 
a source of randomness to remove simulation artifacts such as phase effects that 
may not be present in the real world. Use of RED vs. drop-tail gateways does not 
appear to materially affect performance of our protocol. The RED gateways are 
set up in the following way: we set the queue size to twice the bandwidth-delay 
product of the link, set minthresh to 5% of the queue size and maxthresh to 50% 
of the queue size with the gentle setting turned on. Our TCP connections use 
the standard TCP Reno implementation provided with NS. 





Time (s) 



Fig. 3. (a)(b): TCP Flows and One STAIR with RED, (c) DropTail. 



Figured shows the throughput of a one receiver STAIR flow competing with 
three TCP Reno flows on RED. Figure E^a) shows the throughput of STAIR 
competing with three TCP flows across a dumbbell with a 12ms/ 20Mbps bottle- 
neck link. In this environment, STAIR fairly shares the link with TCP flows. We 
next vary the bandwidth of the bottleneck link to assess the impact of changing 
the range of subscribed NCLs. Figure Elb) shows the average throughput trends 
achieved by three long-running TCP flows and one STAIR flow on various bot- 
tleneck band widths. 

Figure Efc) shows the throughput of STAIR competing with four TCP Reno 
flows on drop-tail gateways competing for 30Mbps bottleneck bandwidth. Dy- 
namics across a bottleneck drop-tail gateway tended to be more dramatic, but 
overall fairness remained high. Here, the throughputs of TCP receivers ranged 
between [5.4Mbps, 6.5Mbps] with a mean per-connection throughput of 6.0 
Mbps, while the STAIR receiver had average throughput of 5.6Mbps. 

We used a second topology to test heterogeneous fairness (see FigureEH under 
different RTTs. We consider a single STAIR session with two STAIR receivers 
and two parallel TCP flows sharing the same bottleneck link. The RTT of STAIR 
receiver I, Rsi, is 60ms, while the RTT of STAIR receiver 2, Rs 2 , is I20ms. In our 
experimental set up, each receiver periodically samples the RTT using ping. Rgi 
subscribes to SLq 4 , which is the closest stair layer based on the measured RTT, 
while Rs 2 subscribes to SLi 2 s- The throughput of each of the flows is plotted in 
Figure 21 Both of the STAIR flows share fairly with the parallel TCP flows with 
the same RTT. Since the throughput of TCP is inversely proportional to RTT, 
the receiver Rs 2 should have approximately half of Rsi’s average throughput. In 
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Fig. 4. Throughput of STAIR and TCP flows sharing bottleneck link with dif- 
ferent RTT,i?si, i?s 2 : STAIR Receiver, Rti,Rt 2 '■ TCP Receiver. 



LInkO: Q0-G1 





Fig. 5. Dilation on the Bottleneck link(G0 - Gl). 



this experiment, the average throughput attained by Rsi was 3.7Mbps and the 
average throughput attained by i?s 2 was 1.6Mbps. 

We then increased the number of STAIR receivers using the topology in 
Figure0 The starting time of 30 STAIR receivers are uniformly distributed from 
1 second to 10 second. The average throughput attained were : Rsi ■ ■ ■ Rsio- 8.6 
Mbps , Rti'. 7.3 Mbps, Rsu . . . Rs 2 o'- 8.7 Mbps, Rt 2 - 8.27Mbps, Rs 2 i---Rszo- 
4.5Mbps, Rtz' 2.3 Mbps. The discrepancy between receiver i?s 3 o and Rfi points 
to an aspect of TCP behavior we have not yet captured accurately. When a 
STAIR receiver subscribes a new maximum j and unsubscribes from layers j — 1 
and j — 2, it can cause a significant increase in bandwidth consumption. Although 
the measured dilation on the link(G3 - G5) is less than the dilation in Figure 
0 the increases were substantial enough to drive TCP flows into timeout. Since 
TCP timeout behavior is not yet accurately reflected in STAIR, some unfairness 
can result. 

We then considered varying the queue size of the bottleneck link router in 
our baseline topology, holding all other parameters constant. To make this sim- 
ulation more interesting, we used drop-tail gateways to magnify the negative 
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Fig. 6. Throughput on Different Queue Size (left) and Different RTT (right), 
with DropTail. 



performance impact of large queues. Note that for these simulations, when the 
queue size is large (overprovisioned with respect to the bandwidth-delay product 
of the link), the RTT is affected by queuing delay. STAIR receivers adapt by 
changing the stair layer depending on the measured RTT. Figure El shows the 
throughput of STAIR and TCP as we vary the queue size. When the RTT varies 
over time, the throughput is affected by the error bound of RTT. Even though 
the average throughput is reduced as the queue size increases, STAIR is not 
especially sensitive to queue sizes, unlike some other schemes. Finally, we con- 
sider varying the link delay on the bottleneck link. Figure El (right) shows that 
as the estimated RTT increases, STAIR becomes less aggressive in accordance 
with TCP. Additional experiments which we conducted are available in the full 
version of this paper |2|. 

6 Conclusions 

We have presented STAIR: a hybrid of cumulative, non-cumulative and stair 
layers to facilitate receiver-driven multicast AIMD congestion control. Our ap- 
proach has the appealing scalability advantage that it allows receivers to operate 
asynchronously with no need for coordination; moreover, receivers with widely 
differing RTTs may simulate different TCP-friendly rates of additive increase. 
While asynchronous joining and leaving of groups at first appears to run the risk 
of consuming excessive bandwidth through a shared bottleneck, in fact judicious 
layering can limit the harmful impact of this issue. 

Our approach does have several limitations, which we plan to address in 
future work. First, while our congestion control scheme tolerates heterogeneous 
audiences, it is primarily designed for users with high end-to-end bandwidth rates 
in the hundreds of Kbps range or higher. We expect that slower users would wish 
to employ a different congestion control strategy than the one we advocate here. 
Second, congestion control approaches which use non-cumulative layering and 
dynamic layers cannot be considered general purpose (just as TCP’s congestion 
control mechanism is not general-purpose) since not all applications can take full 
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advantage of highly layer-adaptive congestion control techniques. For now, the 
only application which integrates cleanly with our congestion control methods 
is reliable multicast of encoded content. We hope to develop scalable STAIR 
methods compatible with other applications, such as real-time streaming. 



References 

1. J. Byers, M. Frumin, G. Horn, M. Luby, M. Mitzenmacher, A. Roetter, and 
W. Shaver. FLID-DL: Congestion Control for Layered Multicast. In Proceedings 
of NGC 2000, pages 71-81, November 2000. 

2. J. Byers and G. Kwon. STAIR: Practical AIMD Multirate Multicast Congestion 
Control, Technical Report BUCS-TR-2001-018, Boston University, Sept. 2001. 

3. J. Byers, M. Luby, and M. Mitzenmacher. Fine-Grained Layered Multicast. In 
Proc. of IEEE INEOCOM 2001, April 2001. 

4. J. Byers, M. Luby, M. Mitzenmacher, and A. Rege. A Digital Fountain Approach 
to Reliable Distribution of Bulk Data. In Proc. of ACM SIGCOMM, pages 56-67, 
1998. 

5. S. Floyd, M. Handley, J. Padhye, and J. Widmer. Equation-based congestion 
control for unicast application . In Proc. of ACM SIGCOMM, 2000. 

6. S. Golestani. Fundamental observations on multicast congestion control in the 
Internet. In Proc. of IEEE INEOCOM’99, New York, NY, March 1999. 

7. A. Legout and E. Biersack. PLM: Fast convergence for cumulative layered multicast 
transmission schemes. In Proc. of ACM SIGMETRICS 2000, pages 13-22, Santa 
Clara, CA, 2000. 

8. S. McCanne, V. Jacobson, and M. Vetterli. Receiver-Driven Layered Multicast. In 
Proc. of ACM SIGCOMM’96, pages 1-14, August 1996. 

9. ns: UCB/LBNL/VINT Network Simulator (version 2). Available at 
http : //www-mash. cs .berkeley . edu/ns/ns .html 

10. J. Padhye, V. Firoiu, D. Towsley, and J. Kurose. Modeling TCP throughput: A 
simple model and its empirical validation. In Proc. of ACM SIGCOMM, 1998. 

11. L. Vicisano, L. Rizzo, and J. Crowcroft. TCP-like Congestion Control for Layered 
Multicast Data Transfer. In Proc. of IEEE INFOCOM’98, April 1998. 

12. J. Widmer, R. Denda, and M. Mauve. A Survey on TCP-Friendly Congestion 
Control. IEEE Network, May 2001. 

13. K. Yano and S. McCanne. The Breadcrumb Forwarding Service: A Synthesis of 
PGM and EXPRESS to Improve and Simplify Global IP Multicast. In Proc. of 
ACM SIGCOMM Computer Communication Review (CCR), 30 (2), April, 2000. 



Impact of Tree Structure on Retransmission 
Efficiency for TRACK 



Anthony Busson*, Jean Louis Rougier, and Daniel Kofman 



Ecole Nationale Superieure des Telecommunications 
46 rue Barrault, 75013 Paris, France 
{abusson, rougier ,kofman}@enst . f r 



Abstract. This paper focuses on tree based reliable multicast protocols, 
or more precisely TRACK (Tree based Acknowledgment) protocols, as 
defined by IETF. With the TRACK approach, the classical feedback 
implosion problem is handled by a set of servers, organized in a tree 
structure, which are in charge of local retransmissions and feedback ag- 
gregation. We study the impact of the control tree structure (for instance 
the number of servers to be deployed) on transmission performances. 
We propose a new model, where point processes represent the receivers 
and the servers, which captures loss correlation phenomena. We are able 
to get explicit expressions of the number of useless retransmissions (use- 
less as the given segment was already received) as a function of a limited 
number of tree characteristics. Generic tree configuration rules, optimiz- 
ing transmission efficiency, are obtained. 



1 Introduction 

I. 1 Reliable Multicast Issues 

IP multicast leads to excellent network utilization when dealing with groups 
communication, as packets sent by the participants are duplicated by the routers 
for the different receivers. Important advances have been made in the multicast 
routing protocols. Multicast applications using IP multicast are becoming widely 
available, and are beginning to be commercialized. These group communication 
applications need to transfer data with a certain degree of reliability. However, 
no multicast transport layer has been standardized as yet. Intense research is 
thus being undertaken to develop multicast transport protocols. The main prob- 
lems with reliable multicast protocols are due to scalability requirements as 
multicast sessions may contain a very large number of receivers (news networks, 
sports events, etc.). The main factor affecting the scalability of multicast reliable 
transport protocols is due to the so-called ’’feedback implosion” problem: if all 
receivers acknowledge the received segments, the source would quickly become 
overffooded with such feedback information as the number of receivers grows. 
On the other hand, the source actually needs to know the receivers window state 

* This work is supported by the french government funded project GEORGES 
(http : / / WWW . telecom . gouv . f r/ rnrt/pro j et s/pgeorges . htm) . 

J. Crowcroft and M. Hofmann (Eds.): NGC 2001, LNCS 2233, pp. 113-127, 2001. 
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in order to retransmit lost segments, and some receiver statistics (RTT, explicit 
or implicit congestion indications,...) for congestion control. Several approaches 
have been designed, and to some extent tested in the Mbone, in order to solve 
this feedback implosion. 

— Negative ACKnowledgements can be used (the so-called NORM : Nack 

Oriented Reliable Multicast inspired being [1]). Such approach make use of 
timers in order to avoid many NACKs being sent for the same data seg- 
ment : when a packet is detected missing, the receiver triggers a backoff 
timeout period, after expiration of the timeout, if the receiver has not re- 
ceived a repair for the lost data or another NACK from other receivers (in the 
case where NACK are multicast) the receiver sends its NACK to the source. 
Such ’’flat” approach is very interesting for its simplicity, minimal configura- 
tion and coordination requirements between members of a multicast session, 
but suffers from scalability issues. 

— Network nodes can be used in order to improve the scalability and efficiency 
of reliable multicast protocols. For instance, in Pragmatic General Multicast 
(PGM, [2]), a receiver which detects a missing packet multicasts a NACK. 
A router which implements the PGM protocol (called a PGM router) imme- 
diately multicasts a suppression message on the same port in order to avoid 
multiple NACKs from receivers. The operation is repeated until the source 
is reached, the recovery packet is then multicasted only on ports which have 
received a NACK. Generic Router Assistance (GRA, [3]) is being studied 
and standardized at IETF, however such mechanisms may be difficult to 
deploy, as existing routers do not support such services. 

— Tree based protocols (TRACK : TRee-based ACKnowledgement [4], [5]) are 
extensively scalable by dividing the set of receivers into controlled subgroups, 
a dedicated server being responsible for local retransmissions and feedback 
aggregation for its subgroup. Further scalability can be obtained by organis- 
ing servers in subgroups and so on, generating a control tree with the source 
(or a specific dedicated server) as root. TRACK architectures require more 
configuration as compared to the previous approaches (in order to set the 
control tree), but it appears to be the most extensively scalable solution for 
reliable multicast (as for instance some experiments on RMTP-II protocols 
[6], and analytical sudies [7] [8] have shown that trees are an answer to the 
scalability problems). In this paper, we shall concentrate on dimensioning 
rules for tree configuration of such TRACK approaches. 

— FEC (Forword Error Control) can be used in order to limit receiver ack- 
onwledgement (or even suppress the need of feedback for ” semi-reliable” 
servivces). Note that FEC may be used in conjunction with the previous ap- 
proaches, particulary with TRACK and NORM. We shall not consider this 
case in the present paper. 

Another big issue regarding the development of reliable multicast protocols 
is the design of congestion control algorithms. For instance, a main difficulty 
here is the heterogeneity of receiver access bit rates and processing capacity. We 
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shall concentrate on the feedback implosion issue, congestion control is out of 
scope of this paper. 

1.2 Contribution 

In this paper, we study the impact of the control tree structure on the retrans- 
mission efficiency. In [9] , we proposed a geometrical model using point processes 
in the plane organized in clusters (representing the receivers location concentra- 
tion such as in AS^, POPs^, etc.). The point processes allowed us to introduce a 
heterogeneity of the loss distribution among the set of receivers and thus to take 
loss correlation into account. The model presented had strong constraints as we 
distributed servers (used in TRACK) within each concentration domain. The 
choice of new point processes which describe the servers allow us in this paper 
to removed these constraints. We define a new cost function representing the 
number of packets received by the participants when they have already received 
it (which we shall call ’’useless retransmisions” ) . We believe that this new cost 
function is more realistic with regards to the traditional number of retransmis- 
sions (used in [9-11]). Stochastic geometry [12, 13] has already been proved to 
be adapted for the study of multicast routing protocols [14]. 

We find explicit formulae for the average cost for a generic distribution of 
receivers within a cluster, and compute the optimal parameters of the tree in 
terms of the number of children per parent. Finally, dimensioning rules are given 
with regard to the parameters describing the macroscopic behavior of the session, 
such as the loss probability inside and outside the concentration domains, the 
total number of receivers and the number of domains. 

The paper is organized as follows: in Section 2, we briefly decribe the TRACK 
protocols. The mathematical model, and the cost function are explained in Sec- 
tion 3 and 4. Numerical results are given in 5. The conclusion is presented in 
Section 6. For the sake of clarity, we give the computation details and explicit 
formulae in the Appendix. 



2 TRACK 

The main issue regarding the development of multicast transport protocols is 
related to scalability. Indeed, for an ARQ based (Automatic Repeat reQuest) 
protocol, each receiver send its own reception state, either periodically by ACKs^ 
or only when a packet is detected missing (NACK). In both cases, the source 
can be flooded by this feedback information. The TRee-based ACK architecture 
(TRACK) [4] [5] avoids this problem by constructing a hierarchy of servers which 
aggregate feedback information of the receivers. Receivers do not send their state 
to the source but to a server called a parent, which aggregates this information 
and repairs the eventual lost packets locally. 

^ Automous System. 

^ Point of Presence. 

® A bit map of received packets is periodically sent to the source. 
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TRACK is designed to reliably send data from a single sender to a group of 
receivers. At the beginning of the multicast session, each receiver bind to a near 
Repair Head (RH). This Repair Head is defined as being responsible for local 
error recovery and receiver state aggregation. Possible mechanisms that allow a 
receiver to know the list of available RHs, and to build the resulting tree are 
given in [15] and [16]. Repair Heads of different receivers (children) are bound 
to a RH of a superior level (parent) . This binding is repeated until the source is 
reached. By convention, the source is level 0 (and is considered as the top RH), 
and the level of children is one more than the one of their parent. 



2.1 Tree Structure 

We shall distinguish two communication channels: the data channel used to 
multicast data from the source to the members of the group and the RHs, and 
the control tree used to aggregate information and repair lost packets. The source 
sends original data on the data channel only once. In the control tree, a RH (or 
the sender) uses a multicast address to perform local retransmissions and to send 
maintenance packets to its children. In order to limit the retransmission scope 
of a RH to the set of its children, each RH repair lost packet with a different 
multicast address. In figure 1(a), we show a data channel and a control tree with 
three levels. 



2.2 Repair Head Functions 

Once the control tree is built, it can be used for retransmissions and receiver 
feedback aggregation. The receiver feedback is necessary for determining even- 
tual lost packets (for RH retransmissions) and statistic collection: these statis- 
tics (RTT, loss rates, number of children,etc.) are aggregated by each RH and 
passed upwards. The source can use these indications for its congestion algorithm 
(which is outside the scope of this paper). In TRACK, receiver feedback consists 
of regular ACKs, and optionnally NACKs. In both cases mechanisms are used 
to control and limit the amount of feedback received by a RH. For ACKs, all the 
receivers unicast their reception states at a different time, ensuring a constant 
(and controlled) load to the RH. In case of NACKs, timer based schemes are 
used to avoid unnecessary messages. When a given segment is detected missing, a 
receiver triggers a random timer. The NACK is sent only if the timer expires and 
if no retransmission for the same segment is received during this time interval. 



2.3 Study of TRACK Protocols 

A variety of studies exist which analyse the behavior of a multicast session. 
A recent work ([17]) analyses the throughput of a one-to-many communication 
with regards to the topology of the routing tree and the number of receivers. 
In [10], the placement of hierarchical reliable multicast servers based on tree 
knowledge is optimized. We believe that in many cases, this won’t be possible 
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(dedicated servers have to be placed in advance). TRACK dedicated servers 
will most probably be deployed without knowledge about receiver location nor, 
therefore, the data multicast tree (as in the mesh approach in [16]). 




(a) Data channel and control tree. 



(b) The model: Clusters in 

Voronoi cells. 



Fig. 1. Control Tree and the Geometric Representation. 



3 The Analytical Model 

In this section, we present the mathematical framework used to model a TRACK 
session. Our approach is based on two models: 

— A geometrical model described on Section 3.1 using point processes in the 
plane which represent the location of participants of a multicast session for 
a single source and the set of RHs, 

— A loss model described on Section 3.2 which represents the distribution of 
losses in the network. 



3.1 Geometrical Model 

Receivers: In order to model the hierarchical structure of the Internet, where 
receivers are concentrated in local regions (e.g. site, AS,etc.), we have chosen the 
cluster point process described below to represent the set of receivers participat- 
ing in the multicast session. In such a process, a first point process is scattered 
in the plane. Each point of this process generates a cluster of points, generating 
a new point process (points of the first process are usually no longer consid- 
ered) . In the case we shall study, the underlying point process is a homogeneous 
Poisson process (the choice of such a process will be discussed in Remark 1). It 
represents the set of locations of the concentration domains. The Clusters are 
i.i.d. (independent and identically distributed) point processes, and represent 
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the receivers within a concentration domain. The resulting processes are called 
Neyman-Scott processes, and are shown to be stationary ([23]). Formally, we 
shall denote by tTcI the Poisson point process which generates the set of clusters 
and Nx the point process which represents the cluster located at x. With these 
notations, the so-defined Neyman-Scott process N defined above can be written: 

N= Y, Nx. 

xGTTcI 

We do not define here the distribution of the clusters, as the formulae (see 
the Appendix) are calculated for a generic distribution. In Figure 2(a), a sample 
of a Neyman-Scott process is shown. 

Repair Heads: The set of RHs is modeled on a homogeneous Poisson point 
process tthh- A RH is a parent of receivers of several clusters. 

We shall further assume that tthh is independent of tTci and N. Actually, we 
assume that the set of RHs are placed in the Internet in advance, for a large 
set of different multicast sessions. The RH locations are thus independent from 
the location of receivers that may join or leave groups at any time. Computing 
average costs (w.r.t. receiver distribution and RH placement) will thus lead to 
average TRACK performance for a wide range of possible group attendance 
distributions. 

For each RH, let Vy{TrnH) be the Voronoi Cell centered at y G t^rh, defined 
as the set of points of which are closer to y than to any other point of ttrh ■ 
The set of Cells of ttrh forms a tesselation of the plane (a poisson process and 
its tesselation is shown in Figure 2(b)). We assume that receivers connect to 
their closest RH (i.e. closest point of ttrh). Thus, the children of y G ttrh are 
the set of receivers within the clusters for which the generating point (a point of 
7Tc/) is inside the cell Vy{-KRH). Formally, all the points of a cluster located at z 
are bound to a point y of ttrh if and only if z G Vy^TTHn) ( see Figure 1(b)). 

We will look at the restrictions of the processes on a finite window in order 
to keep a finite number of receivers, clusters and RHs. This window will be a 
ball in of radius Rw 

3.2 Loss Model 

The losses in a multicast tree are often correlated. Indeed, if a loss occurs in a 
node of the multicast tree, all participants which are downstream will not receive 
this packet. The number of participants which do not receive a packet depends 
on the location of the loss in the multicast tree. Yajnik et al. [22] have studied 
the distribution and the correlation of packet losses in the Mbone. They propose 
a loss model called a modified star topology. In such a model, a packet is lost 
for all receivers (i.e. near the source) with probability p, and may also be lost 
on the way to each receiver (independently of each other) with probability q. 
The probability of reaching a receiver is then (1 — p)(l — y). We have extended 
this modified star topology in order to fit it to the network model presented in 
section 3.1. 
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Our loss model is the following: 

— For the original transmission from the source 

• The probability of reaching any cluster (centered at y G tTcj) is 1— p(||y||), 
i.e. is a function of the distance betweeen the source and the cluster. For 
the sake of simplicity, we shall assume that p(||y||) — p(||a^||) for all y G 
VxiTTci), i.e. that the probability p depends on the distance between the 
source and the center of the cell to which the cluster belongs. However, 
we shall still assume that losses between the different clusters in a given 
cell are independent from each other. 

• If the packet has reached the cluster, the probability of reaching a receiver 
or a RH is 1 — q and is independent of other receivers. 

— For a retransmission from the center of a cell (i.e. from a RH) 

• The retransmission from a RH x G ttrh reaches a cluster (i.e. domain, 
AS...) y G TTci with probability p(|jy — x||) (which depend on the distance 
between the RH and the cluster). For the sake of simplicity, we shall 
approximate each loss probability p{\\y — a;||) with p{m) where m is the 
average distance between a cluster and the center of the closest cell (i.e. 
between the cluster and its RH). From [12], m = 

the intensity of ttrh ■ 

• if the packet has reached the domain, the probability of reaching a re- 
ceiver or a RH is 1 — g and is independent of other receivers. 

In summary, for the original transmission from the source, a packet reaches 
a receiver in cell x with probability (1 — p(||a;||))(l — q). For a retransmission 
from a RH (centers of the cells), a packet reaches a receiver with probability 
(1 -p(m))(l - q). 

We could have added the probability that a packet has been lost for all par- 
ticipants of the multicast session. This case would correspond to a loss occured 
near the source. However, the cost function introduced in Section 4 (which will 
be the number of useless transmissions) would not depend on this parameter. In- 
deed, we shall consider a retransmission is useless for a given receiver (resp. RH) 
whenever a receiver (resp. RHs) receives a packet which it has already received 
it. 

Choice o/p(||a;||).' we can consider that the probability of reaching a point x 
from an other point y is a function of the number of crossed domains. Moreover, 
if we suppose that the number of crossed domains is proportional to the distance 
between these two points (mathematical evidences of this property hold for dif- 
ferent models, e.g. [21]), then we can choose 1 — p(||a;||) such that it represents 
the probabibilty of reaching x from the source when each crossed domain is suc- 
cessfully crossed with a given probability (that we shall denote 1 — a). Thus, 
if we denote by N{x — y) the number of crossed domains between x and y, we 
have: 



l-p(||a:-y||) = (l-a)^fo-fo. 
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Table 1. Loss Probability. 





mean loss 




0.1 


0.625 


a 


0.01 


0.085 




0.001 


0.0088 




0.0001 


0.00088 



where N{x — y) = ^\\x — y\\ (7 > 0). 

In the numerical applications presented in Secion 5, clusters are associated 
with AS and N(x) is chosen to represent the mean number of crossed domains 
in the Internet (i.e. related to the so-called BGP AS path length, that we denote 
by /?). As discussed below other representations are possible. The parameter 7 
is then chosen such that the number of crossed domains for a typical point of 
TTci be equal to (3 (7E[||y||] = (3, y & tTci). In the Table 3.2, we give the mean loss 
probability to reach a typical cluster for a packet sent by the source w.r.t. a. 

Remark 1. In the Mathematical model defined above, clusters are hound to the 
closest RH w.r.t. the euclidian distance. In fact, since in our model p{d) is in- 
creasing with the distance d, the clusters are connected to the RH which min- 
imizes the loss probability. Here the loss probability has been considered, hut it 
would he possible to choose other metrics (as proposed in [16]) such as delay, 
number of hops, etc... Moreover, it should he noted that the cost function does 
not depend directly on the location of points ( which is just a mathematical rep- 
resentation), but depends solely on the distribution of losses. The function p{d) 
allows a mapping between actual average loss probabilities and a two dimensional 
representation of receivers based on clusters in the plane. 

When more sophisticated statistical inference of multicast losses will be avail- 
able (see [18], for instance), an interesting research issue will be to try to find 
more realistic mappings p{d) and network representations. Nevertheless, due to 
lack of statistics, the choice of the Poisson distribution for clusters and RHs is 
motivated by its simplicity. 



4 The Cost Function 

The cost of a reliable multicast is difficult to evaluate as different metrics can 
be optimized (throughput, number of retransmissions, etc.). In this paper, we 
do not use the classical number of retransmissions. Indeed, retransmissions of a 
packet are useful while all receivers have not received the packet. Typically, when 
a loss occurs, a great number of the totality of receivers may not have received 
the packet. We propose a new cost function based on the number of ’’useless” 
transmissions. If a RH retransmits a packet, the number of useless transmissions 
is the number of receivers which will receive this packet when they have already 
received it (directly from the source or in a previous retransmission). Since a 
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(a) A cost functions. 




(b) The evolution of the cost func- 
tion. 



Fig. 4. Cost Functions. 



retransmission from a RH is multicasted, there are always useless retransmissions 
unless all the children of this RH have not received this packet before. 

The cost function is defined as the number of useless retransmissions for each 
level of hierarchy. Let Si be the random variable which represents the number 
of useless retransmissions in the level i, we have: 

2 

Cost = 

i=l 



where E[S'i] is the expectation of Si. 

We note that we could weight each level of this function by constants in order 
to favour retranmission in a particular level rather than an other. For instance, 
since the servers have a cost, it can be interesting to favour useless transmission 
between the RHs and the receivers. 

5 Results 

For the sake of clarity, the computation details of the cost function and its 
explicit formulae are given in the Appendix. These formulae are given for a 
generic distribution of the number of receivers within a cluster. We consider a 
particular case of this distribution in Subsection 5.1 for which we compute the 
optimal value of the intensity of ttrh w.r.t. the loss parameters, the number of 
domains and the mean number of receivers per cluster. 

5.1 Preliminaries 

We do not need the exact point locations within a cluster, since the loss prob- 
ability within a cluster is a constant q. We just have to choose the distribution 
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Fig. 5. Optimal Parameters when the Mean Number of Receivers Varies. 
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Fig. 6. Optimal Parameters when the Loss Probability q Varies. 



of the number of receivers within a cluster. In this paper, the random variable 
which describes this number is Poisson distributed, chosen for its simplicity. In 
futur works, other distributions could be considered based on multicast session 
popularity for instance (e.g. [19,20]). 



5.2 Numerical Results 

We give the optimal number of RHs when the number of receivers per cluster, the 
probability q and the parameter a (see Section 3.2) vary. In the following, when 
parameters are not specified, the default values are: 0.001 for the loss probability 
a and q and 10 for the mean number of receivers in a cluster. 
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The optimal is found for the intensity of tirh which minimizes the cost 
function. We deduce the optimal mean number of receivers per RH, the mean 
number of active RHs^ and the total number of RHs (active or not). 

In Figure 4(a) and 4(b), we show the cost function with regards to the total 
number of RHs. Figure 4(b) shows that the cost function can be very flat around 
its minimum, and the minimum can even be reached for an infinite number of 
RHs (since the function decreases). In this case, each cluster is connected to a 
different RH; the number of active RHs is then equal to the number of clusters. 

Impact of the Number of Receivers. For a fixed number of active clusters (resp. 
100, 1000 and 10000), Figure 5(b) represents the optimal number of RHs as a 
function of the number of receivers per cluster. Not surprisingly, the optimum 
number of RH increases as the number of receiver per cluster grows (the total 
number of receivers grows proportionally). It can be noted that for 100 clusters, 
the optimal number of RH diverges and tends to infinity. Of course, it does not 
make sense to choose an infinite number of RHs (only a finite number of RHs 
will be selected by the receivers in any case), this phenomenon corresponds to 
the fact that the optimum is reached when there is exactly one RH per domain. 
For a larger number of clusters however, it is interesting to note the optimal 
number of RHs is quasi-linear w.r.t the mean number of receivers per cluster, 
which will facilitate the determination of dimensioning rules. 

In Figure 5(a), the optimal number of children connected to a RH is plotted 
(under the same conditions as above). It can be remarked that the optimal 
number of receivers bound to a RH increases as the cluster size increases. The 
optimum quickly reaches a plateau (which is related to the linearity of curves in 
Figure 5(b) ). 

Impact of the Intradomain Loss Probability (Parameter q). As expected, when 
the loss probability within a cluster q increases, the required number of RHs also 
increases (Figures 6(a) and 6(b)). The case where the mean number of clusters 
is 100 has not be drawn because the optimal intensity is infinite; the number 
of required servers is always one per cluster. We remark that even in the case 
where the loss probability q is very high (until 0.4), a RH is still responsible of 
a large number of receivers. 

Impact of the Interdomain Loss Probability (Parameter a). When the probability 
a increases, the cost between the source and its children grows. Therefore, the 
number of children bound to a RH increases (Figure 7(a)) and the total number 
of RHs decreases (Figure 7(b)). The number of children per RH quickly reaches 
a plateau, we note that these values are quite similar to the Figure 5(a). 

Impact of Loss Correlation. In this paragraph, we concentrate on the impact of 
receiver concentration (and thus of loss correlation). For a fixed mean number 
of receivers, we vary the mean number of clusters, and we change the number 

^ RHs which have at least one children. 
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of receivers per cluster accordingly. It can be noted that the loss correlation 
decreases as the number of clusters increases (the more clusters, the less receivers 
per cluster). In Figure 9, the optimal cost function is plotted as a function of 
the mean number of clusters. As expected, loss correlation has an important 
impact on transmission efficiency. It can be observed that the cost decreases as 
the loss correlation decreases, which can be explained by the choice of the cost 
function: When losses are correlated (i.e. a small number of large clusters), a 
retransmission will be useless for a larger number of receivers (almost all the 
nodes of one or several clusters). In Figure 8(a), the mean number of receivers 
bound to a RH is plotted as a function of the number of clusters. We can observe 
that the less important the loss correlation is (i.e. the larger the number of 
clusters), the less RHs are required. In Figure 8(b), we can observe that the 
optimum number of receivers per RH increases with the number of clusters (i.e. 
as loss correlation decreases). In order to avoid unnecessary retransmissions, it 
is better to have a small amount of clusters under the responsability of a given 
RH when loss correlation is important. 

6 Conclusion 

In this paper, we have evaluated the impact of the tree structure on the re- 
transmission efficiency for TRACK architectures. Our study is based on an an- 
alytical model using point processes in the plane and a loss model to represent 
the loss distributions of a multicast transport session. More precisely, our an- 
alytical model uses cluster processes which represents receiver concentration in 
local regions (such as AS, domains, etc.), and allows us to capture loss cor- 
relation amoung receivers. We have defined a cost function which represents 
transmission efficiency, as it counts the number of useless retransmissions of a 
data segment. Explicit formulae has been devised for the average cost function 
w.r.t generic receiver and RH distributions. These formulae allow us to deduce 
the optimal TRACK tree configurations. More precisely, we were able to give 
the optimal number of RHs which must be deployed, in order to maximize re- 
transmission efficiency, with regards to specific topological parameters (such as 
loss probabilities, receiver distributions, etc.). 

We are working on better describing the optimal structures (w.r.t. several 
cost functions) in order to get simple and generic dimensionning rules. We are 
also trying to collect precise statitistical information about loss distributions 
and receiver concentrations for multicast sessions, in order to get more realistic 
random geometric representation of the Internet. 
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Abstract. Researchers have made much progress in designing secure 
and scalable protocols to provide specific security services, such as data 
secrecy, data integrity, entity authentication and access control, to mul- 
ticast and group applications. However, less emphasis has been put on 
how to integrate security protocols with modern, highly efficient group 
communication systems and what issues arise in such secure group com- 
munication systems. In this paper, we present a flexible and modular 
architecture for integrating many different authentication and access con- 
trol policies and protocols with an existing group communication system, 
while allowing applications to provide their own protocols and control 
the policies. This architecture maintains, as much as possible, the scala- 
bility and performance characteristics of the unsecure system. We discuss 
some of the challenges when designing such a framework and show its 
implementation in the Spread wide-area group communication toolkit. 



1 Introduction 

The Internet is used today not only as a global information resource, but also to 
support collaborative applications such as voice- and video-conferencing, white- 
boards, distributed simulations, games and replicated servers of all types. Such 
collaborative applications often require secure message dissemination to a group 
and efficient synchronization mechanisms. Secure group communication systems 
provide these services and simplify application development. 

A secure group communication system needs to provide confidentiality and 
integrity of client data, integrity, and possibly confidentiality, of server control 
data, client authentication, message source authentication and access control of 
system resources and services. 

Many protocols, policy languages and algorithms have been developed to 
provide security services to groups. However, there has not been enough study 
of the integration of these techniques into group communication systems. Needed 
is a scheme flexible enough to accommodate a range of options and yet simple 

* This work was supported by grant F30602-00- 2-0526 from The Defense Advanced 
Research Projects Agency. 
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and efficient enough to appeal to application developers. Complete secure group 
communication systems are very rare and research on how to transition protocols 
into complete systems has been scarce. 

Secure group systems really involve the intersection of three major, and dis- 
tinct, research areas: networking protocols, distributed algorithms and systems, 
and cryptographic security protocols. 

A simplistic approach when building a secure group system is to select a spe- 
cific key management protocol, a standard encryption algorithm, and an existing 
access control policy language and integrate them with a messaging system. This 
would produce a working system, but would be complex, fixed in abilities, and 
hard to maintain as security features would be mixed with networking protocols 
and distributed algorithms. 

In contrast, a more sophisticated approach is to construct an architecture that 
allows applications to plug-in both their desired security policy and the mech- 
anisms to enforce the policy. Since each application has its particular security 
policies, it is natural to give an application more control not only on specifying 
the policy, but on the implementation of the services part of the policy too. 

This paper proposes a new approach to group communication system archi- 
tecture. More precisely, it provides such an architecture for authentication and 
access control. The architecture is flexible, allowing many different protocols to 
be supported and even be executing at the same time; it is modular so that secu- 
rity protocols can be implemented and maintained independently of the network 
and distributed protocols that make up the group messaging system; it allows 
applications to control what security services and protocols they use and config- 
ure; it efficiently enforces the chosen security policy without unduely impacting 
the messaging performance of the system. 

As many group communication systems are built around a client-server archi- 
tecture where a relatively small number of servers provide group communication 
services to numerous clients, we focused on systems utilizing this architecture. Q 

We implemented the framework in the Spread wide-area group communica- 
tion system. We evaluate the flexibility and simplicity of the framework through 
six case studies of different authentication and access control methods. We show 
how both simple (IP based access control, password based authentication) and 
sophisticated (SecurlD, PAM, anonymous payment, and group based) protocols 
can be supported by our framework. 

Note that this paper is not a defense of any particular access control policy, 
authentication method or group trust model. Instead, it provides a flexible, com- 
plete interface to allow many such polices, methods, or models to be expressed 
and enforced by an existing, actively used group communication system. 

The rest of the paper is organized as follows. Section |3 overviews related 
work. We present the authentication and access control framework and its im- 
plementation in the Spread toolkit in Section 0 We provide several brief case 



Some of the work may apply to network level multicast, but we have not explored 
that. 
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studies of how diverse protocols and policies can be supported by the framework 
in Section 0 Finally, we conclude and discuss future directions. 

2 Related Work 

There are two major directions in secure group communication research. The first 
one aims to provide security services for IP-Multicast and reliable IP-Multicast. 
Research in this area assumes a model consisting of one sender and many re- 
ceivers and focuses on the high scalability of the protocols. Since the presence 
of a shared secret can be used as a foundation of efficiently providing data con- 
fidentiality and data integrity, a lot of work has been done in designing very 
scalable key management protocols. For lack of space we cite only the very re- 
cent ones: the VersaKey Framework cni and the Group Secure Association Key 
Management Protocol (GSAKMP) |12| . 

The second major direction in secure group communication research is secur- 
ing application level multicast systems, also known as group communication sys- 
tems. These systems assume a many-to-many communication model where each 
member of the group can be both a receiver and a sender, and provide reliability, 
strong message ordering and group membership guarantees, with moderate seal- 
ability. Initially group communication systems were designed as high-availability, 
fault-tolerant systems, for use in local area networks. Therefore, the first group 
communication systems ISIS 0, Horus EH, Transis 0, Totem 0, and RMP 

were less concerned with addressing security issues, and focused more on the 
ordering and synchronization semantics provided to the application (the Virtual 
Synchrony |H| and Extended Virtual Synchrony models). 

The number of secure group communication systems is small. Besides our 
system (Spread), the only implementation of group communication systems that 
focus on security are the RAMPART system at AT&T | |2()| . the SecureRing [T^ 
project at UGSB and the Horus/Ensemble work at Gornell |22j. A special case 
is the Antigone m framework, designed to provide mechanisms allowing flex- 
ible application security policies. Most relevant to this work are the Ensemble 
and the Antigone systems. Ensemble focused on optimizing group key distri- 
bution, and chose to allow application-dependent trust models in the form of 
access control lists treated as replicated data within the group. Authentication 
is achieved by using PGP. Antigone instead, allows flexible application security 
policies (rekeying policy, membership awareness policy, process failure policy 
and access control policy). However, it uses a fixed protocol to authenticate a 
new member and negotiate a key, while access control is performed based on a 
pre-configured access control list. 

We also consider frameworks designed with the purpose of providing authen- 
tication and/or access control, without addressing group communication issues. 
Therefore, they are complementary to our work. One of these frameworks is the 
Pluggable Authentication Module (PAM) E3 which provides authentication ser- 
vices to UNIX system services (like login, ftp, etc). PAM allows an application 
not only to choose how to authenticate users, but also to switch dynamically 
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between the authentication mechanisms without (rewriting and) recompiling a 
PAM-aware application. Other frameworks providing access control and authen- 
tication services are systems such as Kerberos ^3 Akenti |23|. Both of 
them have in common the idea of authenticating users and allowing access to 
resources, with the difference being that Kerberos uses symmetric cryptography, 
while Akenti uses public-key cryptography to achieve their goals. 

One flexible module system that supports various security protocols is Flex- 
inet m Flexinet is an object oriented framework that focuses on dynamic 
negotiations, but does not provide any group-oriented semantics or services. 



3 General System Architecture 

The overall goal of this work is to provide a framework that integrates many 
different security protocols and supports all types of applications which have 
changing authentication and access control policy requirements, while maintain- 
ing a clear separation of the security policy from the group messaging system 
implementation. In this section, after discussing some design considerations, we 
present the authentication and access control frameworks. 

3.1 Why Is a General Framework Needed? 

When a communication system may only be used with one particular application, 
integrating the specific security policy and needed protocols with the system may 
make sense. However, when a communication system needs to support many 
different applications that may not always be cooperative, separating the policy 
issues which will be unique to each application from the enforcement mechanisms 
which must work for all applications avoids an unworkable “one-size-fits-all” 
security model, while maintaining efficiency. 

Separating the policy implementation from both the application and the 
group communication system is also useful because in a live, production envi- 
ronment, the policy restrictions and access rules will change much more often 
than the code or system changes. So modifications of policy modules should not 
require recompiling or changing the application code. 

The features of the general framework, as opposed to the features of a par- 
ticular authentication or access control protocol, are: 

1. Individual policies for each application. 

2. Efficient policy enforcement in the messaging system. 

3. Simple interface for both authentication and access control modules. 

4. Independence of the messaging system from security protocols. 

5. Many policies and protocols work with the framework, including: access con- 
trol lists, password authentication, public/private key, certificates, role based 
access control, anonymous users, and dynamic peer-group policies. 

We distinguish between authentication and access control modules to provide 
more flexibility. Each type of module has a distinctive interface which supports 
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its specific task. The authentication module verifies that a client is who it claims 
to be. The access control module decides about all of the group communication 
specific actions a client attempts after it has been authenticated: join or leave 
a group, send an unicast message to another client or multicast a message to a 
group. It also decides whether a client is allowed to connect to a server (the access 
control module can deny a connection even if the authentication succeeded) . 

The framework supports dynamic policies. The main challenge with such 
policies is to allow changes during execution. Since the framework itself does 
not have any knowledge of the actual policy, for example it does not cache 
decisions or restrict what form actual policies take, it is possible for the access 
control modules to change how they make decisions independently of server. The 
modules need to make sure they activate dynamic changes in a consistent way, 
by using synchronized clocks, or by using the group communication services to 
agree on when to activate changes. 

3.2 Framework Implementation in Spread 

We implemented the framework in the Spread group communication system to 
give a concrete, real-world basis for evaluating the usefulness of this general 
architecture. Although we only implemented the framework within the Spread 
system, the model and the interface of the framework are actually quite general 
and the set of events upon which access control decisions can be made includes 
all of the available actions in a group-based messaging service (join, leave, group 
send, unicast send, connect). 

3.3 The Spread Group Communication Toolkit 

Spread )7Ta2j is a local and wide-area messaging infrastructure supporting reli- 
able multicast and group communication. It provides reliability and ordering of 
messages (FIFO, causal, total ordering) and a membership service. The toolkit 
supports four different semantics: No membership. Closely Synchronou^, Ex- 
tended Virtual Synchrony (EVS) P and View Synchrony (VS) m- 

The system consists of one or more servers and a library linked with the appli- 
cation. The servers maintain most of the state of the system and provide reliable 
multicast dissemination, ordering of messages and the membership services. The 
library provides an API and basic services for message oriented applications. The 
application and the library can run on the same machine as a Spread server, in 
which case they communicate over IPC, or on separate machines, in which case 
the client-server protocol runs over TCP/IP. 

Note that in order to implement our framework, we needed to modify both the 
Spread client library and the Spread daemon. When an application implements 
its own authentication and access control method, it needs to implement both 
the client side and the server side modules, however, it does not need to modify 
the Spread library or the Spread daemon. 

This is a relaxed version of EVS for reliable and FIFO messages. 
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In Spread each member of the group can be both a sender and a receiver. The 
system is designed to support small to medium size groups, but can accommo- 
date a large number of different collaboration sessions, each of which spans the 
Internet. This is achieved by using unicast messages over the wide-area network 
and routing them between Spread nodes on an overlay network. Spread scales 
well with the number of groups used by the application without imposing any 
overhead on the network routers. Group naming and addressing is not a shared 
resource (as in IP multicast addressing), but rather a large space of strings which 
is unique to a collaboration session. 

The Spread toolkit is available publicly and is being used by several orga- 
nizations for both research and practical projects. The toolkit supports cross- 
platform applications and has been ported to several Unix platforms as well as 
Windows and Java environments. 

3.4 Authentication Framework 

All clients are authenticated when connecting to a server, and trusted afterwards. 
Therefore, when a client attempts actions, such as sending messages or joining 
groups, no authentication is needed. However, the attempted user actions are 
checked against a specified policy which controls which actions are permitted 
or denied for that user. This approach explicitly assumes that as long as a 
connection to the server is maintained, the same user is authenticated. 



Client ^ ^ Server 




Legend: 

— tcp communication 
“O call function 



Fig. 1. Authentication Architecture and Communication Flow. 



Figure [D presents the architecture and the process of authentication. Both 
the client and the server implement an authentication module. 

The change on the client side consists of the addition of a function (see 
Figure 13) that allows an application to set the authentication protocol it wishes 
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to use and to pass in any necessary data to that protocol, before connecting 
to a Spread server. When the function that specifies the request of a client to 
connect to a server is called (SP_connect), the connection tries to use the method 
the application set to establish a connection. The authentication method chosen 
by the application applies to all connections established by this application. 



int SP_set_auth_method( const char *auth_name, int (^authenticate) 
(int, void *) , void * auth_data ); 
int SP_set_auth_methods( int num_methods , const char *auth_name [] , 
int (*authenticate [] ) (int, void *) , void * auth_data[] ); 

/* declaration of authenticate function */ 

int authenticate (int fd, void * user_data_pointer) ; 



Fig. 2. Client Authentication Module API. 



A server authentication module needs to implement the functions listed in 
the auth_ops structure (see Figure El line 10). Then the module should register 
itself with the Spread daemon by calling the Acm_auth_add_method function. By 
default, a module is registered in the ’disabled’ state. The system administrator 
can enable the module when configuring Spread. 

The authentication process begins when the session layer of the daemon re- 
ceives a connection request from a client. After some initial information exchange 
and negotiation of the allowed authentication protocols, the session module con- 
structs a session_auth_info structure containing the list of agreed upon authen- 
tication protocols. This structure is passed as a parameter to each authentication 
function and is used as a handle for the entire process of authenticating a client. 
The authentication function can use the module_data pointer to store any mod- 
ule specific data that it needs during authentication. The session layer calls the 
auth_client_coimection method for each protocol and then “forgets about” the 
client connection. A minimal state about the client is stored, but no messages 
are received or delivered to the client at this point. 

The auth_client_connection function is responsible for authenticating the 
client connection. If authenticating the client will take a substantial amount of 
CPU or real time, the function should not do the work directly, but rather setup 
a callback function to be called later (for example when messages arrive from 
the client), and then it should return. Another approach is to fork off another 
process to handle the authentication. This is required because the daemon is 
blocked while this function is running. 

The auth_client_coimection function never returns a decision value because 
a decision may not have been reached yet. When a decision has been made the 
server authentication module calls Sess_session_report_auth_result and releases 
control to the session layer. The Sess_session_report_auth_result function re- 
ports whether the current authentication module has successfully authenticated 
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struct session_auth_inf o { 
int ses; 

void *module_data; 

int num_required_auths ; 

int completed_required_auths ; 

int required_auth_methods [MAX_AUTH_METHDDS] ; 

int required_auth_results [MAX_AUTH_METHDDS] ; 

}; 



struct auth_ops { 

void (*auth_client_connection) (struct session_auth_inf o 
*sess_auth_p) ; 

}; 



struct acp_ops { 

bool (*open_connection) (char *user) ; 

bool (*open_monitor) (char *user) ; /* not used currently */ 
bool (*join_group) (char *user, char *group, void *acm_token) ; 
bool (*leave_group) (char *user, char *group, void *acm_token) ; 
bool (*p2p_send) (char *user, char dests [] [MAX_GROUP_NAME] , 
int service_type) ; 

bool (*mcast_send) (char *user, char groups [] [MAX_GROUP_NAME] , 
int service_type) ; 



/* Auth Functions */ 

bool Acm_auth_add_method(char *name, struct auth_ops *ops) ; 

/* Access Control Policy Functions */ 

bool Acm_acp_set_policy (char *policy_name) ; 

bool Acm_acp_add_method(char *name, struct acp_ops *ops) ; 



Fig. 3. Server Authentication and Access Control Module API. 



the session or not. If more than one authentication method was required, the 
connection succeeds if all the methods succeed. 



3.5 Access Control Framework 

In our model, an authenticated client connection is not automatically allowed 
to perform any actions. Each action a client may request of the server, such 
as sending a message or joining or leaving a group, is checked at the time it 
is attempted against an access control policy module. The enforcement checks 
are implemented by having the session layer of the server call the appropriate 
access control policy module callback function (see Figure El lines 14-20) return 
a decision. The implementation of the check functions should be optimized as 
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they have a direct impact on the performance of the system as they are called 
for every client action. 

If the module chooses to allow the request, then the server handles it nor- 
mally. In the case of rejection, the server creates a special “reject” message which 
will be sent to the client in the normal stream of messages. The reject message 
contains as much of the data included in the original attempt as possible. The 
application should be able to identify which message was rejected by whatever 
information it stored in the body of the message (such as an application level 
sequence number) and respond to it appropriately. That response could be a no- 
tification to the user, establishing a new connection with different authentication 
credentials and retrying the request, logging an error, etc. 

The server can reject an action at two points, when the server receives the 
action from the client or when the action is going to take effect. For example, 
when a client joins a group the join can be rejected when the join request is 
received from the directly connected client, and when the join request has been 
sent to all of the servers and has been totally ordered. Rejecting the request 
the first time it is seen avoids processing requests that will later be rejected 
and simplifies the decision-making because only the server the client is directly 
connected to will make the decision. The disadvantage is that at the time the 
request is being accepted or rejected the module only knows the current state of 
the group or system and not what the state will be when the request would be 
acted upon by the servers. Since these states can differ, some type of decisions 
may not be possible at the early decision point. 

4 Case Studies 

To provide some intuition as to what building a Spread authentication module 
requires, this section discusses the implementation of several real-world modules: 
an IP based access control module, a password based authentication module, a 
SecurlD or PAM authentication module, an anonymous payment authentication 
and anonymous access control module, and a dynamic peer-group authentication 
module. For more details and implementation code see [0|. 

IP Access Control. A very simple access control method that does not involve 
any interaction with the client process or library, is one that is based on the 
IP address of the clients. The connection is allowed considering the IP address 
from which the client connected to the server. This module only restricts the 
open_connection (see Figure 01 line 15) operation. 

Password Authentication. A common form of authentication uses some type of 
password and username to establish the identity of the user. Many types of pass- 
word based authentication can be supported by our framework from passwords 
sent in the clear (like in telnet) to challenge-response passwords. 

To implement a password-based authentication method, both a client and 
a server side need to be implemented. The server module can use the Events 
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subsystem in Spread to wait for network events to occur and avoid blocking the 
server while the user is entering its password or the client and server modules are 
communicating. The client module consists of one function which is called during 
the establishment of a connection and returns either success or failure. The func- 
tion can use the file descriptor of the socket over which the connection is being 
established and whatever data pointer was registered by the SP_set_auth_method. 
In this case the application prompted the user for a username and password and 
created a user_password structure. The authenticate function, sends the user- 
name and the password to the server and waits for a response, informing it of 
whether or not the authentication succeeded. 

SecurlD. A popular authentication method is RSA SecurlD. The method uses 
a SecurlD server to authenticate a SecurlD client based on a unique randomly 
generated identifier and a PIN. In some cases the SecurlD server might ask the 
client to provide new credentials. We do not discuss here the internal of the 
SecurlD authentication mechanism (see for more details), but focus on how 
our framework can accommodate this method. 

The main difference from the previous examples is that here the Server Au- 
thentication Module needs to communicate with the SecurlD server. As men- 
tioned before, the auth_client_connection function should not block. Blocking 
can happen when opening a connection with a SecurlD server and retrieving 
messages from it. Therefore, auth_client_connection forks another process re- 
sponsible for the authentication protocol and then registers an event such that 
it will get notified when the forked process finished. The forked process estab- 
lishes a connection with the SecurlD Server and authenticates the user. When 
it finishes, the Server Authentication Module gets notified, so it can call the 
Sess_session_report_auth_result function to inform the Spread daemon that a 
decision was taken and to pass control back to it. 

PAM. Another popular method of authentication is the modular PAM [Zg sys- 
tem which is standard on Solaris and many Linux systems. Here the authenti- 
cation module will act as a client to a PAM system and request authentication 
through the standard PAM function calls. To make authentication through PAM 
work, the module must provide a way for PAM to communicate and interact with 
the actual human user of the system, to prompt for a password or other informa- 
tion. The module would register an interactivity function with PAM that would 
pass all of the requests to write to the user or request input from the user over 
the Spread communication socket to the Spread client authentication module 
for PAM. This client module would then act on the PAM requests and interact 
with the user and then send the reply back to the Spread authentication module 
which would return the results to the actual PAM function. 

Anonymous Payments. An interesting approach is when access is provided to 
anonymous clients in exchange for payment. These systems m perform transac- 
tions between a client and a merchant, assuming that both of them have accounts 
with a Bank. By using cryptographic techniques, the system provides anonymity 
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of the client and basic security services. We do not detail the cryptographic de- 
tails, but show how this method can be accommodated in our framework. 

We assume support from the anonymous payments system (in the form of 
an API) and require the servers and the client to have an account with a Bank. 
When a client connects to a server, the Client Authentication Module generates 
a check and an identifier of client’s account and then passes them to the Server 
Authentication Module which will then contact the Bank to validate the check (if 
necessary another process will be forked as in the SecurlD case) . When validated, 
the Server Authenticated Module will register the client’s identifier with the 
access control policy as a paid user of the appropriate groups. Then, for as long 
as the payment was valid, the client will be permitted to access the groups they 
paid for and the server has no knowledge of the client’s identity. 

Group-Based Authentication. In all the previous authentication methods pre- 
sented, the authentication of a client is handled by the server that the client 
connects to. In larger, non-homogeneous environments authentication may in- 
volve some or all of the group communication system servers. Although these 
protocols may be more complex, they can provide better mappings of adminis- 
trative domains, and possibly better scalability. 

An example of such a protocol is when a server does not have sufficient 
knowledge to check a client’s credentials (for instance a certificate). In this case, 
it sends the credentials to all the servers in the configuration and each server 
then attempts to check the credentials itself and sends an answer back. If at 
least one server succeeds, the client is authenticated. The particularity of such 
a protocol is that the servers need to communicate between them as part of 
the authentication process. Since all the servers can communicate between them 
in our system, the framework provides all necessary features that allows the 
integration of such a group-based authentication method. 

Access Control. We realize that the above case studies are focused on authenti- 
cation. Few standard access control protocols that we could use as case studies 
exist. To demonstrate the ability of the access control architecture we create 
a case study about an imaginary secure IRC system. Consider a set of users 
where some users are allowed to chat on the intelligence group, while others are 
restricted to the operations group. Some are allowed to multicast to a group but 
are not allowed to read the group messages (virtual drop-box). Our framework 
supports these access control policies through appropriate implementation of the 
join and multicast hooks defined in Figure 0. Access control modules support 
identity based, role based, or credential based restrictions. 

5 Conclusions and Future Work 

We presented a flexible implementation of an authentication and access control 
framework in the Spread wide area group communication system. Our approach 
allows an application to write its own authentication and access control modules. 
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without needing to modify the Spread client or server code. The flexibility of 
the system was showed by showing how a wide range of authentication methods 
can be implemented in our framework. 

There are a lot of open problems that are subject of future work. These 
include: providing tools that allow an application to actually specify a policy, 
handling policies in a system supporting network partitions (for example merging 
components with different policies), providing support for meta-policies defin- 
ing which entity is allowed to create or modify group policies, and developing 
dynamic group trust protocols for authentication. 
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Abstract. Bi-directional shared tree is an efficient routing scheme for 
interactive multicast applications with multiple sources. Given the open-group 
IP multicast service model, it is important to perform sender access control so 
as to prevent group members from receiving irrelevant data, and also protect the 
multicast tree from various Denial-of-Service (DoS) attacks. Compared with 
source specific trees and uni-directional shared trees where information sources 
can be authorized or authenticated at the single root or Rendezvous Point (RP), 
in bi-directional trees this problem becomes challengeable since hosts can send 
data to the shared tree from any network point. In this paper we propose a 
scalable sender access policy mechanism for bi-directional shared trees so that 
irrelevant data is policed and discarded once it hits any on-tree router. We 
consider the scenario of both intra-domain and inter-domain routing in the 
deployment of the policy, so that the mechanism can adapt to situations in 
which large-scale multicast applications or many concurrent multicast sessions 
are involved, potentially across administrative domains. 



1 Introduction 

IP multicast [9] supports efficient communication services for applications in which 
an information source sends data to a group of receivers simultaneously. Although 
some IP multicast applications have been available on the experimental Multicast 
Backbone {MBone) for several years, large-scale deployment has not been achieved 
until now. IP multicast is also known as “Any Source Multicast {ASM)” in that an 
information source can send data to any group without any control mechanism. In the 
current service model, group management is not stringent enough to control both 
senders and receivers. IGMPvl [11] is used to manage group members when they join 
or leave the session but in this protocol there are no control mechanisms to avoid 
receiving data from particular information sources or prevent particular receivers 
from receiving sensitive information. It has been observed that the above 
characteristics of IP multicast have somehow prevented successful deployment of 
related applications at large scale on the Internet [10]. 

Realizing that many multicast applications are based on one-to-many 
communications, e.g. Internet TV/radio, pushed media, etc., H. W. Holbrook et al 
proposed the EXPRESS routing scheme [14], from which the Source Specific 
Multicast (SSM) [15] service model was subsequently evolved. In SSM each group is 
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identified by an address tuple (S, G) where S is the unique address of the information 
source and G is the destination channel address. A single multicast tree is built rooted 
at the well-known source for delivering data to all subscrihers. Under such a scenario, 
centralized group authorization and authentication can be achieved at the root of the 
single source at the application level. Currently IGMPv3 [7] is under the development 
to support source specific joins in SSM. 

On the other hand, it should he noted that there exist many other applications based 
on many-to-many styled communication, such as multi-party videoconferencing 
system. Distributed Interactive Simulation (DIS) and Internet games etc. For this type 
of interactive applications, bi-directional multicast trees such as Core Based Tree 
(CBT) [2], Bi-directional PIM [13], and RAMA style Simple Multicast [19], are 
efficient routing schemes for natively delivering data between multiple hosts. 
However, since there is no single point for centralized group access control, sender 
authorization and authentication become new challenges. Typically, if a malicious 
host wants to perform Denial-of-Service (DoS) attack it can flood bogus data from 
any point of the bi-directional multicast tree. Sender access control for bi-directional 
trees based on IP multicast model is not provided in the specification of any 
corresponding routing protocols such as [2, 13]. One possible solution that has been 
proposed is to periodically “push” the entire sender access list down to all the on-tree 
routers, so that only data from authorized senders can he accepted and sent onto the 
bi-directional tree [6]. This simple access control mechanism has been adopted in the 
7?AMA-style Simple Multicast [14]. However, this policy is not very scalable 
especially when many multicast groups or large group size with many senders are 
involved. A more sophisticated scheme named Keyed-HIP {KHIP) [21] works on the 
routing level to provide data access control on the bi-directional tree, and flooding 
attacks can be also detected and avoided by this network-level security routing 
scheme. 

In this paper we will propose an efficient and scalable sender access control 
mechanism for bi-directional trees in the IP multicast service model. The basic idea is 
to deploy access policy for external senders on the tree routers where necessary, so 
that data packets from unauthorized senders will be policed and discarded once it hits 
the bi-directional tree. Our proposed scheme causes little impact on the current bi- 
directional routing protocols so that it can be directly implemented on the Internet 
without modifying the basic function of the current routing protocols. Moreover, the 
overhead introduced by the new control mechanism is much smaller than that 
proposed in [6] and [19]. 

The rest of the paper is organized as follows: Section 2 gives the overview of our 
proposed dynamic maintenance of the policy. Sections 3 and 4 introduce sender 
authorization and authentication in intra-domain and inter-domain routing. Operations 
on multi-access networks are specially discussed in section 5. We examine the 
scalability issues of our proposed scheme in section 6, and finally we present a 
summary in section 7. 
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2 Sender Authorization and Authentication Overview 

Compared with source specific trees and even uni-directional shared trees such as 
PIM-SM [8], in which external source filtering can be performed at the single source 
or Rendezvous Point (RP) where the registrations of all the senders are processed and 
authorized, in bi-directional trees this is much more difficult since data from any 
source will be directly forwarded to the whole tree once it hits the first on-tree router. 
In fact, since there is no single point for centralized sender access control, information 
source authorization and authentication has to be deployed at the routing level. As we 
have already mentioned, the simplest solution for this is to periodically broadcast the 
entire access control list down to all the routers on the bi-directional tree for deciding 
whether or not to accept data (e.g., [19]). However, this method is only feasible when 
a few small-sized groups with limited number of senders are considered. For large 
scale multicast applications, if we don’t send the whole policy down to all the on-tree 
routers so as to retain the scalability, three questions need to be answered as proposed 
in [6]: (1) How to efficiently distribute the list where necessary? (2) How to find edge 
routers that act as the trust boundary? (3) How to avoid constant lookups for new 
sources? In fact if we try to statically mount the access control policy to an existing 
bi-directional multicast tree, none of the above three questions can be easily 
answered. 

It should be noted that most multicast applications are highly dynamic by nature, 
with frequent join/leaving of group members and even information senders. Hence the 
corresponding control policy should also be dynamically managed. Here we propose 
an efficient sender-initiated distribution mechanism of the access list during the phase 
of multicast tree construction. The key idea is that each on-tree router only adds its 
downstream senders to the local Sender Access Control List (SACL) during their join 
procedure, and the senders in the access list are activated by the notification from the 
core. In fact, only the core has the right to decide whether or not to accept the sources 
and it also maintains the entire SACL for all the authorized senders. Packets coming 
from any unauthorized host (even if it has already been in the tree) will be discarded 
at once when they reach any on-tree router. To achieve this, all senders must first 
register with the core before they can send any data to the group. When a registration 
packet hits an on-tree router, the unicast address of the sender is added into the SACL 
of each router on the way. Under this scenario, the access policy for a particular 
sender is deployed on the branch from the first on-tree router where the registration is 
received along to the core router. Here we define the interface from which this 
registration packet is received as the downstream interface and the one used to deliver 
unicast data to the core as the upstream interface. The format of each SACL entry is 
(G, S, 7) where G indicates the group address, S identifies the sender and I is the 
downstream interface from which the corresponding registration packet was received. 
If the core has approved the join, it will send a type of “activating packet” back to the 
source, and once each on-tree router receives this packet, it will activate the source in 
its SACL so that it will be able to send data onto the bi-directional tree from then on. 
Under such a scenario, an activated source can only send group data to the tree via the 
path where its SACL entry has been recorded, i.e., even if a sender has been 
authorized, it cannot send data to the group from other branches or elsewhere. Source 
authentication entries kept in each SACL are maintained in soft state for flexibility 
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purpose, and this requires that information sources should periodically send refreshing 
packets to the core to keep their states alive in the upstream routers. This action is 
especially necessary when a source is temporarily not sending group data. Once data 
packets have been received from a particular registered sender, the on-tree router may 
assume that this source is still alive and will automatically refresh the state for it. If a 
particular link between the data source and the core fails, the corresponding state will 
time out and become obsolete. In this case the host has to seek alternative path to 
perform re-registration for continuing sending group data. 

When a router receives a data packet from one of its downstream interfaces, it will 
first check if there exists such an entry for the data source in its local SACL. If the 
router cannot find a matching entry that contains the unicast address of the source, the 
data packet is discarded. Otherwise if the corresponding entry has been found, the 
router will verify if this packet comes from the same interface as the one recorded in 
the SACL entry. Only if the data packet has passed these two mechanisms of 
authentication, it will be forwarded to the upstream interface and the other interfaces 
with the group state, i.e., interfaces where receivers are attached. On the other hand, 
when a data packet comes from the upstream interface, the router will always forward 
it to all the other interfaces with group state and need not perform any authentication. 
Although the router cannot judge if this data packet is from a registered sender, since 
it comes from the upstream router, there exist only two possibilities: either the 
upstream router has the SACL entry for the data source or the upstream router has 
received the packet from its own parent router in the tree. The extreme case is that 
none of the intermediate ancestral routers have such an entry and then we have to 
backtrack to the core. Since the core has recorded entries for all the registered senders 
and it never forwards any unauthenticated packet on its downstream interfaces, we 
can safely conclude that each on-tree router can trust its parent, and hence packets 
received from the upstream interface are always from valid senders. However, this 
scenario precludes the case of routers attached on multi-access networks such as 
LANs, and we will discuss the corresponding operations in section 5. 

Compared with source specific trees and even uni-directional shared trees such as 
PIM-SM [8], in which external source filtering can be performed at the single source 
or Rendezvous Point (RP) where the registrations of all the senders are processed and 
authorized, in bi-directional trees this is much more difficult since data from any 
source will be directly forwarded to the whole tree once it hits the first on-tree router. 
In fact, since there is no single point for centralized sender access control, information 
source authorization and authentication has to be deployed at the routing level. As we 
have already mentioned, the simplest solution for this is to periodically broadcast the 
entire access control list down to all the routers on the bi-directional tree for deciding 
whether or not to accept data (e.g., [19]). However, this method is only feasible when 
a few small-sized groups with limited number of senders are considered. For large 
scale multicast applications, if we don’t send the whole policy down to all the on-tree 
routers so as to retain the scalability, three questions need to be answered as proposed 
in [6]: (1) How to efficiently distribute the list where necessary? (2) How to find edge 
routers that act as the trust boundary? (3) How to avoid constant lookups for new 
sources? In fact if we try to statically mount the access control policy to an existing 
bi-directional multicast tree, none of the above three questions can be easily 
answered. 
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It should be noted that most multicast applications are highly dynamic by nature, 
with frequent join/leaving of group members and even information senders. Hence the 
corresponding control policy should also be dynamically managed. Here we propose 
an efficient sender-initiated distribution mechanism of the access list during the phase 
of multicast tree construction. The key idea is that each on-tree router only adds its 
downstream senders to the local Sender Access Control List (SACL) during their join 
procedure, and the senders in the access list are activated by the notification from the 
core. In fact, only the core has the right to decide whether or not to accept the sources 
and it also maintains the entire SACL for all the authorized senders. Packets coming 
from any unauthorized host (even if it has already been in the tree) will be discarded 
at once when they reach any on-tree router. To achieve this, all senders must first 
register with the core before they can send any data to the group. When a registration 
packet hits an on-tree router, the unicast address of the sender is added into the SACL 
of each router on the way. Under this scenario, the access policy for a particular 
sender is deployed on the branch from the first on-tree router where the registration is 
received along to the core router. Here we define the interface from which this 
registration packet is received as the downstream interface and the one used to deliver 
unicast data to the core as the upstream interface. The format of each SACL entry is 
(G, Sy I) where G indicates the group address, S identifies the sender and I is the 
downstream interface from which the corresponding registration packet was received. 
If the core has approved the join, it will send a type of “activating packet” back to the 
source, and once each on-tree router receives this packet, it will activate the source in 
its SACL so that it will be able to send data onto the bi-directional tree from then on. 
Under such a scenario, an activated source can only send group data to the tree via the 
path where its SACL entry has been recorded, i.e., even if a sender has been 
authorized, it cannot send data to the group from other branches or elsewhere. Source 
authentication entries kept in each SACL are maintained in soft state for flexibility 
purpose, and this requires that information sources should periodically send refreshing 
packets to the core to keep their states alive in the upstream routers. This action is 
especially necessary when a source is temporarily not sending group data. Once data 
packets have been received from a particular registered sender, the on-tree router may 
assume that this source is still alive and will automatically refresh the state for it. If a 
particular link between the data source and the core fails, the corresponding state will 
time out and become obsolete. In this case the host has to seek alternative path to 
perform re-registration for continuing sending group data. 

When a router receives a data packet from one of its downstream interfaces, it will 
first check if there exists such an entry for the data source in its local SACL. If the 
router cannot find a matching entry that contains the unicast address of the source, the 
data packet is discarded. Otherwise if the corresponding entry has been found, the 
router will verify if this packet comes from the same interface as the one recorded in 
the SACL entry. Only if the data packet has passed these two mechanisms of 
authentication, it will be forwarded to the upstream interface and the other interfaces 
with the group state, i.e., interfaces where receivers are attached. On the other hand, 
when a data packet comes from the upstream interface, the router will always forward 
it to all the other interfaces with group state and need not perform any authentication. 
Although the router cannot judge if this data packet is from a registered sender, since 
it comes from the upstream router, there exist only two possibilities: either the 
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upstream router has the SACL entry for the data source or the upstream router has 
received the packet from its own parent router in the tree. The extreme case is that 
none of the intermediate ancestral routers have such an entry and then we have to 
backtrack to the core. Since the core has recorded entries for all the registered senders 
and it never forwards any unauthenticated packet on its downstream interfaces, we 
can safely conclude that each on-tree router can trust its parent, and hence packets 
received from the upstream interface are always from valid senders. However, this 
scenario precludes the case of routers attached on multi-access networks such as 
LANs, and we will discuss the corresponding operations in section 5. 



3 Intra-domain Access Control Policy 



3.1 SACL Construction and Activation 

As we have mentioned, all the information sources must register with the core before 
they can send any data to the bi-directional tree. For each on-tree router, its SACL is 
updated when the registration packet of a new sender is received, and the individual 
entry is activated when its corresponding activating notification is received from the 
core. 

If a host wants to both send and receive group data, it must join the multicast group 
and become a Send-Receive capable member (SR- member, SRM). Otherwise if the 
host only wants to send messages to the group without receiving any data, it may 
choose to act as a Send-Only member (SO-member, SOM) or a Non-Member Sender 
(NMS). In the former case, the host must join the bi-directional tree to directly send 
the data, and its designated router will forward the packets on the upstream interface 
as well as other interfaces with the group state. In the IP multicast model information 
sources are allowed to send data to the group without becoming a member. Hence, if 
the host is not interested in the information from the group, it may also choose to act 
as a non-member sender. In this case, the host must encapsulate the data and unicast it 
towards the core. Once the data packet hits the first on-tree router and passes the 
corresponding source authentication, it is decapsulated and forwarded on all the other 
interfaces with the group state. The following description is based on the CBT routing 
protocol, but it can also apply to other bi-directional routing schemes such as Bidir- 
PIM and RAMA-style Simple Multicast. 

( 1 ) SR-member Join 

When the Designated Router (DR) receives a group G membership report from a 
SR-member S on the LAN, it will send a join request towards the core. Here we note 
that the group membership report cannot be suppressed by the DR if it is submitted by 
a send-capable member. Once a router receives this join-request packet from one of 
its interfaces, say. A, then the (G, S, A) entry is added into its SACL. If the router is 
not been on the shared tree, a (*, G) state is created with the interface leading to the 
core as the upstream interface and A is set to the downstream interface. At the same 
time, interface A is also added to the interface list with group state so that data from 
other sources can be forwarded to S via A. If the router already has the (*, G) state, 
but A is not in the interface list with group state, then it is added to the list. Thereafter, 
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the router just forwards the join-request to the core via its upstream interface. Once 
the router receives the activating notification from the core, the (G, S, A) entry is 
activated so that S is able to send data. 

(2) SO-member join 

Similar to SR-member joins, the DR of a SO-member also sends a join-request up 
to the core and when the router receives this request from its interface A, the (G, S, A) 
entry is added to the local SACL. If the router is not yet on the tree, (*, G) state will be 
generated but the interface A is not added to the interface list with group state. This is 
because A needs not to forward group data to a send-only member. 

(3) Non-Member Sender (NMS) registration 

Here we use the terminology “registration” instead of “join request”, since this host 
is not a group member and need not be on the tree to send group data. The registration 
packet from the Non-Member Sender is unicast towards the core and when it hits the 
first router with (*, G) state, the (G, S, A) entry will be created in the local SACL of all 
the on tree routers on the way leading to the core. It should be noted, if a router is not 
on the tree, it does not maintain SACL for the group. 

Finally, if a receive-only member (also known as the group member in 
conventional multicast model) wants to join the group, the join request only invokes a 
(* G) state if the router is not on the tree, but no new SACL entries need to be created. 
Moreover, once the join request hits any on-tree router, a join-notification is 
immediately sent back without informing the core. 

The forwarding behavior of an on-tree router under send access control mechanism 
is as follows. If group data comes from downstream interfaces, the router will 
authenticate the information source by looking up the local SACL and if the sender 
has its entry in the list and comes from the right interface, the data is forwarded on the 
upstream interface and other interfaces with group state. If the corresponding SACL 
check fails, the data is discarded at once. On the other hand, if the data comes from 
the upstream interface, it is forwarded to all the other interfaces with the group state 
because a router’s parent is always trusted by its children. 



3.2 An Example for Intra-domain Access Policy 

A simple network model is given in Fig. 1 . We assume that node A is the core router 
and all the Designated Routers {DR) of potential members of group G should send 
join request to this node. Hosts H1-H5 are attached to the individual routers as shown 
in the figure. 

Initially suppose HI wants to join the group, its DR (router B) will create (* G) state 
and send the join request to the core A. Since HI is a SR-member that can both send 
and receive data to/from the group, each of the routers that the join request has passed 
will add this sender into its local SACL. Hence both router B and A will have the 
SACL entry (G, HI, I), since they both receive the join request from interface 1. 
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Receive Only 




Send Only Send & Receive 



Fig. 1. Intra-domain Network Model. 

Host H2 only wants to send messages to group G but does not want to receive any 
data from this group, and so it may choose to join as a SO-member or just act as a 
NMS. In the first case, its DR (router C) will create (*, G) state indicating that this 
router is an on-tree node and then add H2 to its SACL. Thereafter, router C will send a 
join request indicating H2 is a SO-member towards the core; when B receives this 
request, it will also add H2 to its local SACL and then forward the join-request packet 
to A. Since H2 does not want to receive data from the group, link BC becomes a send- 
only branch. To achieve this, router B will not add B3 to the interface list with group 
state. If H2 chooses to act as the Non-Member Sender, router C will not create (*, G) 
state or SACL for the group but send a registration packet towards A. When this 
packet hits an on-tree router, say, B in our example, H2 will be added to the local 
SACL of all the routers on the way. When sending group messages, router C just 
encapsulates the data destined to the core by setting the corresponding IP destination 
address to A. When the data reaches B and passes the SACL authentication, the IP 
destination address is changed to the group address originally contained in the option 
field of the data packet, and the message is forwarded to interfaces B1 and B2 to get 
to HI and the core respectively. After H3 and H4 join the group, the resulting shared 
tree is shown in Fig. 2, and SACLs of each on-tree router are also indicated in the 
figure. It should be noted that H4 is a receive-only member, and hence Router E, F 
and A need not add it to their local SACLs. Suppose router F has received group data 
from H3 on interface F3, it will check in its local SACL if H3 is an authorized sender. 
When the data passes the address and interface authentications, it is forwarded to both 
interfaces FI and F2. When group data is received on the upstream interface FI, since 
its parent A is a trusted router (in fact the data source should be either HI or H2), the 
data is forwarded to F2 and F3 immediately without any authentication. However, if 
the non-registered host H5 wants to send messages to the group, data won’t be 
forwarded to the bi-directional tree due to the SACL authentication failure at router F. 
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(G.Hl.l) 

(G, H2, 3) Receive Only 




Send Only Send & Receive 



Fig. 2. Bi-directional Tree with SACL. 



4 Inter-domain Access Control Policy 



4.1 Basic Descriptions 

As we have mentioned above, on-tree routers only maintain the access policy for all 
the downstream senders. However, if large-scale groups with many senders or many 
concurrent sessions are considered, the size of the SACL in the routers near the core 
will become a heavy burden for these on-tree routers. In this section we discuss how 
this situation can be improved with the aid of inter-domain IP multicast routing 
semantics. 

Our key idea is based on hierarchical access control policy to achieve scalability. 
All routers only maintain SACL for the downstream senders in the local domain and 
need not add sources from downstream domains to their local SACLs. In other words, 
all the senders for the group are only authenticated in the local domain. In the root 
domain, the core needs to keep entries only for local senders; however in order to 
retain the function of authorizing and activating information sources from remote 
domains, on receiving their registrations the core router needs to contact a special 
access control server residing in the local domain, which decides whether or not to 
accept the sending requests. 

For each domain, a unique border router (BR) is elected as the “policy agent” and 
keeps the entire SACL for all the senders in the local domain, and we name this BR 
Designated Border Router (DBR) for the domain. In fact the DBR can be regarded as 
core of the sub-tree in the local domain. In this sense, all the data from an upstream 
domain can only be injected into the local domain from the unique DBR and all the 
senders in this domain can only use this DBR to send data up towards the core. This 
mechanism abides to the “3“* party independence” policy in that data from any sender 
must be internally delivered to all the local receivers without flowing out of the 
domain. This requires that joins from different hosts (including both senders and 
receivers) merge at a common point inside the domain. In BGP-4, all the BRs of a 
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Stub domain know for which unicast prefix(es) each of them is acting as the egress 
router, this satisfies the above requirement of “path convergence” of internal joins. 

Since individual sender authentication is performed within each domain and 
invalid data never gets any chance to flow out of the local domain, the on-tree BR of 
the upstream domain will always trust its downstream DBR and assumes that all the 
data packets coming from it are originated from authorized senders. Hence, when a 
packet leaves its local domain and enters remote domains, no further authentication is 
needed. This also avoids constant lookups when the authenticated data is traveling on 
the bi-directional tree. 



4.2 Inter-domain SACL Construction and Activation 

Since Border Gateway Multicast Routing (BGMP [16]) has been considered as the 
long-term solution to the Inter-domain multicast routing, in this section we will take 
BGMP as an example to illustrate how sender access control policy can be deployed 
in inter-domain applications. 

First we will discuss how the DR for a group member sender submits its join 
request and how it is added to the SACL and activated. This applies to both SR- 
members and SO-members, the only difference between the two being whether or not 
to add the interface from which the join-request was received to the interface list with 
the group state. Only if an on-tree router receives a join request from a sender in the 
local domain, it will add this sender to its SACL, otherwise the router will just forward 
the join request towards the core without updating its local SACL. 

In Fig. 3, when host S wants to become a SO-member to send data, its DR (router 
A) sends a join request towards the DBR router B, which has the best exit to the root 
domain. All the internal routers receiving this request will add S into their local 
SACLs. Since B is the core of the sub-tree for the local domain, it also needs to create 
a SACL entry for host S once it receives the join request from its Multicast Interior 
Gateway Protocol (M-IGP) component. Thereafter, B finds in its Group Routing 
Information Base (G-RIB) that the best route to the root domain is via its external peer 
C in the transit domain, so router B will send the BGMP join request towards C via its 
BGMP component. Once router C receives the join request, it creates (* G) state (if it 
has not been on the tree), but will not create an entry for S in its local SACL. When C 
finds out that the best exit toward the root domain is D, it just forwards the join 
request to this internal BGMP peer, and hence router D becomes the DBR of the 
transit domain for group G. Suppose Bidir-PIM is the MIGP, the RP in this transit 
domain should be placed at D, and router C will use its M-IGP component to send the 
join request towards D. When this join request travels through the transit domain, 
none of the internal routers along the way in the domain will add S into their local 
SACLs. After the join request reaches the root domain and the core router F authorizes 
the new sender by contacting the access control server and sends back the activating- 
notification, all the on-tree routers (including internal on-tree routers and the DBR) in 
the transit domain just forward it back towards the local domain where the new sender 
S is located. When the packet enters the local domain, all the on-tree routers (namely 
B and A in Fig. 3) will activate S in their SACLs. 




Scalable IP Multicast Sender Access Control for Bi-directional Trees 



151 



Join-request 




Activation 



Fig. 3. Inter-domain Join-Request. 

As we have also mentioned, a send-only host may also choose to act as a Non- 
Member Sender (NMS). However there are some restrictions when inter-domain 
multicast routing is involved. If a send-only host is located in the domain where there 
are no receivers (we call this domain a send-only domain), then the host should join 
the bi-directional tree as a SO-member other than a Non-Member Sender (NMS). 
Otherwise if the host acts as a NMS, its registration packet will not hit any on-tree 
router until it enters remote domains. This forces the on-tree router there to add the 
sender that is from another domain to its local SACL, which does not conform to the 
rule that on-tree routers only maintain access policy for senders in the local domain. 
On the other hand, if the host joins as a SO-member and since its DR will be on the 
bi-directional tree, the authentication can be achieved by the on-tree routers in the 
local domain. 



4.3 An Example for Inter-domain Access Policy 

An example for inter-domain sender access control is given in Fig. 4. C is the core 
router and domains X, Y and Z are remote domains regarding the core C. Hosts a, b, c 
and d are attached to the routers in different domains. Also suppose that host a only 
wants to receive data from the group, hosts b and c want to both send and receive, 
while host d only wants to send messages to the group without receiving any data 
from it. In this case, A is a receive-only domain and Z is a send-only domain. XI, Y1 
and Z7 are border routers that have been selected as the DBR for each domain. 
According to our inter-domain access control scheme, on-tree routers have the SACL 
entry for downstream senders in the local domain, and each DBR has the policy for all 
the senders in the local domain. Hence, Y1 has the entry for hosts b and c in its SACL 
while the SACL of XI contains no entries at all. Although X is the parent domain of Y 
and Z which both contain active senders, all the on-tree routers in X need not add 
these remote senders to their SACL. In fact data coming from Y and Z has already 
been authenticated by their own DBRs (namely Y1 and Zl) before it flows out of the 
local domains. Since host d only wants to send data to the group and there are no 
other receivers in domain Z, as we have mentioned, host d should join as a send-only 
member. Otherwise if d acts as a non-member sender and submits its registration 
packet towards the core, this makes the first on-tree router (X2) add d to its SACL, 
however this is not scalable because on-tree routers are forced to add senders from 
remote domains. On the other hand, if host d joins as a send-only member, the shared 
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tree will span to its DR, namely Z2, and then the authentication can he performed at 
the routers in the local domain. 

As we know, BGMP also provides the mechanism for building source-specific 
branches between border routers. In Fig. 4, we suppose that the current M-IGP is 
PIM-SM. At certain time the DR in domain Y such as Y3 or Y4 may wish to receive 
data from host d in domain Z via the shortest path tree. Hence (5, G) state is 
originated and passed to the border router Y5, which is not the current DBR of domain 
Y. When Y5 receives the source specific join, it will create (5, G) state and then send 
the corresponding BGMP source specific join towards Z7 . On the other hand, since Z1 
is the DBR of domain Z, intra-domain sender authentication has been performed 
before the traffic is sent to Z1 ’s BGMP component for delivery to remote domains. In 
fact Y5 will only receive and accept data originated from host d in domain Z due to its 
{S, G) state filtering. Once Y5 receives the data from host d, it can directly forward it 
to all the receivers in the local domain, since RPF check can be passed. When the DR 
receives the data from d via the shortest path, it will send a source specific prune 
message up towards the root domain to avoid data duplication. It should be noted that 
(*, G) state should only exist in the DBR for each domain/group, and internal nodes 
may only receive source specific traffic via alternative border routers. From this 
example, it is observed that source specific tree can also interoperate with the 
proposed sender access control in the receiver’s domain (note that the MIGP in 
domain Y is not bi-directional routing protocol). 

I 




Fig. 4. Example for Inter-domain Sender Access Control. 



5 Operations on Multi-access Networks 

We need special consideration for protecting group members from unauthorized 
sources attached to multi-access networks such as LANs. As we have mentioned, if an 
on-tree router receives data packets from its upstream interface, it will always forward 
them to all the other interfaces with group state, since these packets have been 
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assumed to come from an authorized information source. However this may not be 
the case if the upstream interface of an on-tree router is attached to a broadcast 
network. When an unauthorized host wants to send data with group address to the 
multi-access LAN, a corresponding mechanism must be provided to prevent these 
packets from being delivered to all the downstream group members. To achieve this, 
once the Designated Router {DR) on the LAN receives such a packet from its 
downstream interface, if it cannot find a matching access entry for the data source in 
its SACL, it will discard the packet, and at the same time this DR will send a type of 
“forbidding” control packet containing the unicast address of the unauthorized host to 
the LAN from its downstream interface. Take the CBT routing protocol as an example, 
the IP destination address of this forbidding packet should be “all-cbt-router address 
(224.0.0.15)” and the value of TTL is set to 1. Once the downstream router receives 
this packet on its upstream interface, it will stop forwarding the data with this unicast 
address that originates from an unregistered host attached to the LAN. Hence all the 
downstream session members will only receive little amount of useless data for a 
short period of time. In terms of implementation, the downstream on-tree routers 
should maintain a “forbidding list” of unauthorized hosts recorded. Since all the 
possible unauthorized hosts can only come from the local LAN, this list will not 
introduce much overhead to the routers. In Fig. 5, suppose the unauthorized host S 
sends data to the group. When the DR (router A) cannot find the corresponding entry 
in its local SACL, it immediately discards the packet and then sends a “forbidding” 
packet containing the address of S onto the LAN. Once the downstream router B 
receives the forbidding packet, it will stop forwarding data coming from host S. 



Downstream 

Domain(s) 




Core 



Fig. 5. Access Control Operation on LANs. 

In inter-domain routing, further consideration is necessary for data traffic traveling 
towards the core. This is because routers in transit domains do not have SACL entry 
for remote senders in their SACLs. Also take Fig. 5 as an example, suppose that the 
LAN is located in a transit domain where there are no local authorized senders, and 
hence router A’s SACL is empty. If there is data appearing on the LAN destined to the 
group address, there are only two possibilities: (1) the data came from a downstream 
domain and was forwarded to the LAN by router 5; (2) a local unregistered host 
attached to the LAN (e.g., host S) sent the data. It is obvious that in the former case 
router A should pick up the packet and forward it towards the core, and for the latter, 
it should just discard the packet and send the corresponding “forbidding” packet onto 
the LAN. Hence this requires that the router be able to distinguish between packets 
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coming from remote domains and packets coming from directly attached hosts on the 
LAN. However, this is easy to achieve by simply checking the source address prefix. 



6 SACL Scalability Analysis 



In this section we discuss scalability issues regarding router memory consumption. 
For simplicity we only discuss the situation of intra-domain routing here. 
Nevertheless, when inter-domain hierarchical sender access control is involved, the 
situation can be improved still further. It is obvious that the maximum memory space 
needed in maintaining a SACL is 0(ks) where k is the number of multicast groups and 
s is the number of senders in the group. Typically this is exactly the size of SACL in 
the core router. However, since on-tree routers need not keep the access policy for all 
sources but only for downstream senders, the average size of SACL in each on-tree 
router is significantly smaller. 

We can regard the bi-directional shared tree as a hierarchical structure with the 
core at the top level, i.e., level 0. Since each of the on-tree routers adds its 
downstream senders to its local SACL, then the SACL size S of router i in the shared 
tree T can be expressed as follows: 



5 ( 0 = ^(SU)) 

and the average SACL size per on-tree router is: 



S = 



H i, 



1=0 J=\ 






( 1 ) 

( 2 ) 



where H is the number of hops from the farthest on-tree router (or maximum level) 
and Lj is the number of routers on level i, while 



{ 1 if router i is included in the shared tree (3) 

0 otherwise 

To ensure that the scalability issues are fairly evaluated throughout our simulation, 
random graphs with low average degrees, which represent the topologies of common 
point-to-point networks, e.g., NSFNET, are constructed. Here we adopt the commonly 
used Waxman’s random graph generation algorithm [22] that has been implemented 
in GT-ITM, for constructing our network models. For simplicity, we only consider 
intra-domain routing scenarios in our simulation. 

First we study the relationship between average SACL size and total number of 
senders. In the simulation we generate a random network with 100 routers with the 
core router also being randomly selected. The number of senders varies from 10 to 50 
in steps of 10 while the group size is fixed at 50. Here we study three typical 
situations regarding the type of sending hosts: 
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(1) All senders are also receivers (AM); 

(2) 50% senders are also receivers {HM)\ 

(3) None of the senders are receivers (NM). 

All send-only hosts choose to act as Non-Memher Senders (NMS). 




Fig. 6. SACL size vi. Number of Senders (I). 

From Fig. 6 we can see that the average SACL size grows as the number of senders 
increases. However, it can be observed that even when the number of senders reaches 
a size as large as 50, the average SACL size is still very small (less than 4 in size on 
average). This is in significant contrast with the strategy of “full policy maintenance” 
(FPM) on each router [6, 19]. Further comparison between the two methods is 
presented in Table 1. From the figure we can also find that if all the senders are also 
receivers on the bi-directional tree (case AM), this results in a larger average SACL 
size. On the other side, if none of the senders is a receiver (case N M), the 
corresponding SACL size is smaller. This phenomenon is expected because given the 
fixed number of receivers on the bi-directional tree as well as the sender group, the 
larger the proportion of senders coming from receiver set, the larger the resulting 
average SACL size. However this gap decreases with larger sender group size. 




Fig. 7. SACL size vj. Number of Senders (II). 
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Next we study the effect on SACL size resulting from the senders’ choice of acting 
as a Send-Only Member {SOM) or a Non-Member Sender (NMS). As we have 
mentioned, a host only wishing to send data to the group can decide to act as a SOM 
or NMS. Fig. 7 illustrates the relationship between the SACL size and total number of 
senders. The group size is fixed at 50 and the number of senders varies from 5 to 40 in 
steps of 5. It should be noted that in this simulation all group members are receive- 
only hosts and do not send any data to the group. From the figure we can see that the 
SACL size also grows with the increase of the number of senders. Moreover, if all the 
hosts join the bi-directional tree and act as Send-Only Members {SOM), the average 
SACL size is smaller. The reason for this is obvious: If the hosts choose to take the 
role of SOM, this will make the bi-directional tree expand for including the DRs of 
these senders. Since the number of on-tree routers grows while the total number of 
senders remains the same, the resulting average SACL size will become smaller. On 
the other hand, if all of the hosts just act as Non-Member-Senders, the figure of the 
shared tree will not change and no more on-tree routers are involved. 




Fig. 8. Average SACL Size vs. Group Size. 

We continue to study the relationship between the average SACL size and the 
group size (number of receivers) with number of senders fixed at 20. We still let these 
senders choose to act as a SOM or NMS respectively. From Fig. 8 we can see that the 
SACL size decreases with the growth of the group size in both cases. On the other 
hand, join results in smaller average SACL size compared with NMS. The gap is 
more significant when there are fewer receivers. This is because if senders choose to 
act as SOM, they have to join the tree and generate many send-only branches, i.e., 
more routers are involved in the bi-directional tree. If the hosts just send data without 
becoming group members, the shared tree won’t span to any of these senders, so that 
the number of on-tree routers is independent of the number of senders. When the 
group size is small (e.g., 5 receivers), the size of the bi-directional tree will be 
increased significantly to include all the senders if they join as SOMs. This explains 
why the gap is more obvious when a small set of receivers is involved. 
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Table 1. Comparison with FPM. 
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0.65 


1.27 


1.82 


2.3 


NMS 


0.73 


1.4 


2.09 


2.73 



Finally we give the comparison between our method and the “full policy 
maintenance” (FPM) strategy regarding router’s memory consumption. Table 1 gives 
the relationship of SACL size and total number of senders (S). From the table we can 
see that the length of the access list recorded in each on-tree router in FPM 
mechanism is exactly the number of active senders. This imposes very big overhead 
on routers compared with our proposed scheme. Although the core router also has to 
maintain the full access list in our method when intra-domain routing is considered, 
the situation could be improved in large-scale multicast applications by hierarchical 
control in inter-domain routing which we introduced in section 4. 



7 Summary 

In this paper we propose an efficient mechanism of sender access control for bi- 
directional multicast trees in the IP multicast service model. Each on-tree router 
dynamically maintains access policy for its downstream senders. Under such type of 
control, data packets from unauthorized hosts are discarded once they hit any on-tree 
router. In this sense, group members won’t receive any irrelevant data, and network 
service availability is guaranteed since the multicast tree is protected from denial-of- 
service attacks such as data flooding from any malicious host. In order to achieve 
scalability for large-scale multicast applications with many information sources and to 
accommodate more concurrent multicast sessions, we also extend our control 
mechanism to inter-domain routing where hierarchical access policy is maintained on 
the bi-directional tree. Simulation results also show that the memory overhead of our 
scheme is quite light so that good scalability can be achieved. 

Nevertheless, this paper only provides a general paradigm of sender access control, 
but does not present a solution to the restriction of sources based on the specific 
interest from individual receivers. Related works include [12], [17] and [18], and this 
will be one of our future research directions. 
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Abstract. Several protocols have been proposed to deal with the group 
key management problem. The most promising are those based on hi- 
erarchical binary trees. A hierarchical binary tree of keys reduces the 
size of the rekey messages, reducing also the storage and processing re- 
quirements. In this paper, we describe a new efficient hierarchical binary 
tree (EHBT) protocol. Using EHBT, a group manager can use keys al- 
ready in the tree to derive new keys. Using previously known keys saves 
information to be transmitted to members when a membership change 
occurs and new keys have to be created or updated. EHBT can achieve 
(7 • logj n) message size (7 is the size of a key index) for join operations 
and {K ■ logj n) message size (77 is the size of a key) for leave operations. 
We also show that the EHBT protocol does not increase the storage and 
processing requirements when compared to other HBT schemes. 



1 Introduction 

With IP multicast communication, a group message is transmitted to all mem- 
bers of the group. Efficiency is clearly achieved as only one transmission is needed 
to reach all members. The problems start because any machine can join a mul- 
ticast group and start receiving the messages sent to the group without the 
sender’s knowledge. This characteristic raises concerns about privacy and secu- 
rity since not every sender wants to allow everyone to have access to its commu- 
nication. 

Cryptographic tools can be used to protect group communication. An en- 
cryption algorithm takes input data (e.g. a group message) and performs some 
transformations on it using a key (where the key is a randomly generated num- 
ber). This process generates a ciphered message. There is no easy way to recover 
the original message from the ciphered text other than by knowing the key jO] • 
When applying such technique, it is possible to run secure multicast sessions. 
Group messages are protected by encryption using a chosen key {group key). 
Only those who know the group key are able to recover the original message. 
However, distributing the group key to valid members is a complex problem. 
Although rekeying a group before the join of a new member is trivial (send the 
new group key to the old group members encrypted with the old group key), 

* The work presented here was done within the context of ShopAware - a research 
project funded by the European Union in the Framework V 1ST Programme. 
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rekeying the group after a member leaves is far more complicated. The old key 
cannot be used to distribute a new one, because the leaving member knows the 
old key. A group manager must, therefore, provide other scalable mechanisms to 
rekey the group. 

Several researchers have studied the use of a hierarchical binary tree (HBT) 
for the group key management problem. Using an HBT, the key distribution 
centre (KDC) maintains a tree of keys, where the internal nodes of the tree hold 
key encryption keys (KEKs) and the leaves correspond to group members. Each 
leaf holds a KEK associated to that one member. Each member receives and 
maintains a copy of the KEK associated to its leaf and the KEKs correspondent 
to each ancestor node in the path from its parent node to the root. All group 
members share key held by the root of the tree. For a balanced tree, each member 
stores log2 n + 1 keys, where n is the number of members. This hierarchy is 
explored to achieve better performance when updating keys. 

In this paper, we propose a protocol to efficiently built an HBT, which we 
call the EHBT protocol. The EHBT protocol achieves (/-log2 n) message size for 
addition operations and {K ■ log2 n) message size for removal operations keeping 
the storage and processing on both, client and server sides to a minimum. We 
achieve these bounds using well-known techniques, such as a one-way function 
and the xor operator. 



2 Related Work 

Wallner et al H3) were the first to propose the use of an HBT. In their approach, 
every time the group membership changes, internal node keys (affected by the 
membership change) are updated and every new key is encrypted with each of 
its children’s keys and then multicast. A rekey message conveys 2 • log2 n keys 
for including or removing a member. 

Caronni et al H2! proposed a very similar protocol to that of Wallner, but 
they achieve a better performance regarding the size of multicast messages for 
joining operations. We refer to this protocol as HBT-I-. Instead of encrypting 
new key values with their respective children’s key, Caronni proposes to pass 
those keys into a one-way function. Only the indexes of the refreshed keys need 
to be multicast and an index size is smaller than the key size. 

An improvement to the hierarchical binary tree approach is the one-way 
function tree (OFT) proposed by McGrew and Sherman j^j. The keys of a node’s 
children are blinded using a one-way function and then mixed together using 
the xor operator. The result of this mixing is the KEK held by the node. The 
improvement is due to the fact that when the key of a node changes, its blinded 
version is only encrypted with the key of its sibling node. Thus, the rekey message 
carries just log2 n keys. 

Canetti et al | 3 | proposed a slightly different approach that achieves the same 
communication overhead. Their scheme uses a pseudo-random-generator (PRG) 
P] to generate the new KEKs rather than a one-way function and it is applied 
only on user removal. 
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Perrig et al proposed the efficient large-group key (ELK) protocol j^l- The 
ELK protocol is very similar to the OFT, but ELK uses pseudo-random func- 
tions (PRFs)G to build and manipulate the keys in the tree. ELK employs a 
timely rekey, hence, at every time interval, the KDC refreshes the root key using 
the PRF function and then uses it to update the whole key tree. By deriving all 
keys, ELK does not require any multicast messages during a join operation. All 
members can refresh their own keys, hence no rekey message is required. When 
members are deleted, as in OFT, new keys are generated from both its children’s 
keys. 



3 Efficient Hierarchical Binary Tree Protocol 

In the EHBT protocol, a KDC maintains a tree of keys. The internal nodes of the 
tree hold KEKs and the leaves correspond to group members. Keys are indexed 
by randomly chosen numbers. Each leaf holds a secret key that is associated to 
that member. The root of the tree holds a common key to all members. 

Ancestors of a node are those nodes in the path from its parent node to the 
root. The set of ancestor of a node is called ancestor set. Each member knows 
only its own key (associated to its leaf node) and keys correspondent to each 
node in its ancestor set. For a balanced tree, each member stores log 2 n-l- 1 keys, 
where n is the number of members. 

In order to guarantee backward and forward secrecy the keys related to 
joining members or leaving members should be changed every time the group 
membership changes. The new keys in the ancestor set of an affected leaf are 
generated upwards from the key held by the affected leaf’s sibling up to the root. 
Using keys that are already in the tree can save information to be transmitted 
to members when a membership occurs and new keys have to be created or 
updated. 

The formula T(x,y) = h{x 0 y) is used to generate keys from other keys, 
where h\s & one-way hash function and 0 is a normal xor operator. The obvious 
functionality of function h is to hide the original value of x and y into value z in 
a way that if one knows only z he cannot find the original values x and y. The 
functionality of 0 is to mix x and y and generate a new value. 

We say that a key ki can be refreshed by doing fc' = where i is 

the index (or identifier) of key ki or key ki can be updated by deriving one of 
its children key by doing fc' = ,i), where jg 

either left or right child. Appendix E describes the reason for using index i in 
function T . 



3.1 Rekey Message Format 

A member can receive two types of information in a rekey message, one telling 
him to refresh or update the value of a key, the other telling him the new value 



^ ELK uses the stream cipher RC5 0 as the PRF. 
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of a key. In the former case, the member receives an id and in the latter case, 
he receives a key value. After deriving a key, a member will try to derive all 
other keys by himself (from that key up to the root) unless he receives another 
information telling him something different. For example, if key Ki is refreshed, 
the KDC needs to send to iF’s holders the identification of the key so that they 
can perform the refresh operation themselves. Or, if a node n has its key updated 
= J-{kL,n)), then it implies sending to member L the index n and to the 
other child, namely R, the new key value (because R does not know L’s key). 



+ 1 



I 

J — 



indexes and 
commands 



+ 





{Kj}K, 





keys 



Fig. 1. Example of a Rekey Message. 



The rekey message that relays this information has two parts. The first part 
carries commands and the second carries keys. Each piece of information is 
indexed by a key index. Keys are encrypted with the key indicated by the key 
index (see Figure^, but commands are not encrypted because they do not carry 
vital information. Based on commands and keys, members can find out which 
keys they must refresh or update, or just substitute, because they have received 
a new key value to a specific key. 



Algorithm 1: Reading Rekey Message Algorithm. 

(1) receive rekey message 

(2) set last command to "keep key” 

(3) while there is a key to be derived 

(4) get a key index from key-list 

(5) search indexes part of rekey message for key index 

(6) if there is a command 

(7) execute the command on the specific key 

(8) set last command to this command 

(9) else 

(10) search keys part of the rekey message for key index 

(11) if there is a key 

(12) substitute it in the key list 

(13) set last command to "update” 

(14) if there is no command or key 

(15) execute last command in current key 



The algorithm to handle rekey messages starts with a member holding a list of 
known keys (key-list). After executing the algorithm, a member will have all his 
keys freshened up. A simplified version of this algorithm appears in Algorithm^ 
In the remainder of this paper, we use the following notation: 
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+i or —i or 
n{ki) 

U{h) 

{^)k 

j : command 
[commands, keys] 



are commands to be applied on key i 

refresh ki applying T{ki,i) 

update kj applying T{ki,j) 

encryption of x with k 

command to key j’s holder 

message containing commands and keys 



4 Basic Operations 

In this section, we describe the basic algorithms for join and leave operations for 
single and multiple cases. 





Fig. 2. User U2 Joins the Tree. Fig. 3. Users U2 and U3 Join the Tree. 



Single Member Join Algorithm. When a member joins the group, it is 
associated to a leaf node n. The KDC assigns a randomly chosen key to n. 
Leaf n is then included in the tree at the parent of the shallowest leaf node s 
(to keep the tree as short as possible). Leaf s is removed from the tree, and in 
its place a new node p is inserted. Leaves s and n are inserted as p's children. 
We see an example in Figure |3 Member 2 is placed in leaf ri2, which is inserted 
at node ni2- Node ni2 becomes the new parent of leaves ni and U2. Leaf U2 is 
assigned key k2- 

In order to keep the backward secrecy, keys in ni’s ancestor set need to 
receive new values. Key ki is refreshed {k[ = TZ{ki)), K12 receives a value based 
on k[ {Ki 2 = U{k[)) and keys K14 and Kis are refreshed {K'14 — TZ{Ki4) and 
A^8 = 7^(Al8)). ^ 

Note that during a join operation, keys, which were already in the tree, are 
just refreshed. Members holding those keys only need to be told those keys’ 
indexes to be able to generate their new values, which means that these keys do 
not have to be transmitted. In the same way, members that had their keys used 
for generating new keys just have to be told the index of the new key and they 
can generate that key by themselves. 

The KDC generates unicast messages for member ri2 {[k2, K12, K'14, K[s]) and 
member ni ([+ 12 ]), and multicast message [14 : * 14, 18 : * 18 ]. 
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Member U2 receives its unicast message and creates its key-list. Member ui 
receives its unicast message and derives key K12, including it in its key-list. 
Members holding keys K14 and Kis refresh these keys. 



Multiple Members Join Algorithm. Several new members are inserted in 
the tree as in the single member join algorithm. They are associated to nodes 
and the nodes are placed at the parent of the shallowest leaves. However, the 
keys in the tree are modified in a slightly different manner. New nodes’ ancestor 
sets converge at some point and all keys that are in more than one ancestor set 
are modified only once. 

See Figure 0 for an example. Members U2 and M3 joined the group and have 
been placed at nodes ni2 and 7143, respectively. Following the single member join 
algorithm, the keys in member M2 ’s ancestor set are changed: first, k[ = T^(fci), 
and then, K12 =U{k[), K14 = TZ{Ki4), K[g = TZ{Kis). In the same way, keys in 
member M3’s ancestor set are changed: first, k'4 = TZ{k4), and then, A43 = U{k'4). 
Keys K\2 and iFig have already been changed because of member U2, hence they 
are not changed again. 

The KDC generates unicast messages for member U2 ([fe, A12, ^(4, K[s\). 
member ([fcs, ^43, ^{4, Jfjg]), member ui ([+ 12 ]) and member M4 ([+ 43 ]), and 
multicast message [14 : * 14, 18 : * 18 ]. 

Members U2 and M3 receive their unicast messages and create their respective 
key-lists. Member mi receives the unicast message, derives key K12, and includes 
it in its key-list. Member M4 does the same with key A43. Members holding keys 
K\4 and Ais refresh these keys. 




Single Member Leave Algorithm. When a member u leaves or is removed 
from the group, its sibling s replaces its parent p. Moreover, all keys known by 
M should be updated to guarantee forward secrecy. For example, see Figure 0 
M2 leaves (or is removed from) the group and its node is removed from the tree. 
Node ni2 is also removed and leaf ni is promoted to its place. 
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In order to keep the forward secrecy, keys in ni’s ancestor set need to receive 
new values. Keys K14 and Kis have to be updated: k[ = 'R-(ki), — U{k'-^) 

and K[^=U{K[4). 

Note that in removal operations, all keys in the removed member’s ancestor 
set are updated. Those keys cannot be just refreshed because the removed mem- 
ber knows their previous values and could easily calculate the new values. Since 
the new values are all generated from the removed member’s sibling key, which 
was not known by the removed member, the removed member cannot find the 
new values. 

The KDC generates multicast message [1 : —12, 

Member ni refreshes k[ and, because it has removed K12, it updates Kn 
and Kis. Members holding key K34 get new key K[^ and then update key Kis- 
Members holding key get new key K[^. 



Multiple Members Leave Algorithm. This algorithm is handled similarly 
to the single member leave algorithm. The leaving nodes are removed and the 
tree shape is adjusted accordingly. As in the multiple join algorithm, there can 
be several different path from removed nodes to the root, which means that the 
root key can be updated by several nodes (see Figure EJ. 

In order to avoid several root key versions for the same operation, the KDC 
chooses one of the paths and use it to update the root key. For example, in 
Figure El n.2 and ne leave the group and nodes ni and are promoted to their 
respective parents’ places (ni2 and n^e). Both are used to derive their new parent 
keys K'i4 and Agg, but then they both cannot be used to update key A(g. In this 
case, the KDC chooses one of them to update key K[g and the other will receive 
the updated key. For instance, the KDC chooses node ni and then the keys are 
updated as follows: k[ = TZ{ki), K[^ = U{k[), A(g = U{K[^), k'^ = TZ{k^) and 

The KDC generates multicast message [1 : — 12,5 : — 56 , ,{K5s}k > 

5 o 

Member ni refreshes fcj and, because it has removed A12, it updates K'14 
and A(g. Key holders recover and update A(g. Member 715 refreshes 

fcg and updates Agg, but since there is a new key encrypted with Agg, ng stops 
updating its keys and just recovers A(g. Key Arg’s holders recover Agg and, 
since there is a key encrypted with it, they just recover A(g. 



Rebalancing. The efficiency of the key tree depends crucially on whether the 
tree remains balanced or not. A tree is said to be balanced if no leaf is much 
further away from the root than any other leaf. In general, for a balanced binary 
tree with n leaves, the distance from the root to any leaf is log2 n, but if the 
tree is unbalanced, the distance from the root to a leaf can become as high as 
n. Therefore, it is desirable to keep a key tree as balanced as possible. 

The rebalancing works by getting the shallowest and deepest internal nodes 
and comparing their depths. If the depth gap is larger than two then it means 
that the tree is unbalanced and needs to be levelled. For balancing the tree, the 
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deepest leaf node is removed, which makes its sibling to go one level up (similarly 
to the removing algorithm), and inserted at the shallowest node (similarly to the 
inserting algorithm) . This procedure is repeated until the difference between the 
depths of the shallowest and the deepest nodes is smaller than two. 

In a rebalancing operation, the deepest node, which has been moved from 
one position in the tree to another, requires that its old keys need to be updated 
(as in a deletion operation) and it needs to have access to the keys in its new 
path to the root (as in an insertion operation). Therefore, an insertion and a 
deletion are performed simultaneously. 




Fig. 6. Rebalancing the Tree. 



See Figure 0 for an example. The tree needs a rebalancing, so leaf rig is 
deleted from its original position (ngg) and inserted into a new position (ngg). 
The deletion starts a removal operation with leaf ng updating the new keys. At 
the same time, leaf ng starts refreshing the keys on its path (as an insertion 
requires). The new keys are calculated as follows: kg = TZ{kg), K’^g = U{kg), 
AT'g = U{K'jg), k'g = U{kg) , Kgg = U{k'g) aud Key K[g does not 

need to be changed. 

The KDC generates unicast messages for member ng ( [Kgs , K'ls] and member 
ug ([+38]), and multicast message [9 : —89, 18 : *18, , {Agg}^^^]. 

Member ng deletes all its known keys and replaces them by those just re- 
ceived. Member ng updates its keys. Members ny and key kggs holders extract 
their parts and update their keys. Member ng derives Agg. Key Aig’s holders 
refresh A(g. 

5 Evaluation 

In this section, we compare the properties of the EHBT algorithm with the other 
algorithms introduced in section|21 PRGtQ (Canetti et al.), HBT+ (Caronni et 

^ Canetti does not specify the PRG function to use, hence we assume the same RC5 
algorithm used in ELK. 



EHBT: An Efficient Protocol for Group Key Management 167 



al), OFT (McGrew and Sherman) and ELK (Perrig). We focus our criteria on 
KDC computation, joined member computation (for insertions), sibling compu- 
tation (sibling to the joining/leaving member), size of the multicast message, 
size of the unicast messages and storage at both KDC and members. 

The notations used in this section are: 



n 

d 

I 

K 

G 

H 

X 

E 

D 



number of member in the group 

height of the tree (for a balanced tree d = log 2 n) 

size of a key index in bits 

size of a key in bits 

key generation 

hash function execution 

xor operation 

encryption operation 

decryption operation 



Table □ summarizes the computation required from the KDC, joined mem- 
ber and sibling to joined member, and message size of joining member’s unicast 
message, sibling’s unicast message and multicast message during single join op- 
erations. 



Table 1. Single Join Operation Equations. 



Scheme/ 

Resource 


Computation 


Message size 


KDC 


Join member 


Sib member 


Join unicast 


Sib unicast 


Multicast 


EHBT 


G + (d + 1)(X + H + E) 


(d + 1)D 


(d + 1)(X + H) 


(d + 1)K 


I 


dl 


PRGT 


2G + dH + (d + 1)B 


(d + 1)D 


D + dH 


(d + 1)K 


/ + K 


dl 


HBT+ 


2G + dH + (d + 1)E 


(d + 1)D 


D + dH 


(d + 1)K 


/ + K 


dl 


OFT 


G + (d + 1)H + dX + 3dE 


(d + 1)D + d(H + X) 


2D + d(H + X) 


(d + 1)K 


I + 2K 


id + 1)K 


ELK 


G + (4n - 2)E and (d + 3)E 


(d + 1)D 


2dE and 2E 


id + 1)K 


I 


0 



Table 0 summarises multiple join operation equations. The parameters anal- 
ysed are the same parameters used in Table Q The equations are valid for mul- 
tiple joins when the original number of members is doubled after the mass join, 
which means that every old member gets a new sibling (a new member) and all 
the keys in the tree are affected. This represents the worst case possible for join 
operations. For the sake of the equations in this table, n is the original number 
of members in the group previously to the mass join, but d is the new height of 
the tree after the mass join. 

EHBT requires less computation than the other schemes, but it loses out 
to ELK when comparing the message sizes. The reason for that is that ELK 
employs a timed rekey, which means that the tree is completely refreshed at 
intervals, despite membership changes, thus only the index of the new parent 
inserted needs to be sent to the sibling of the joining member. However, this rises 
two issues: first, at every interval the KDC has to refresh all its 2n-l keys, which 
implies unnecessary work for the KDC; second, this scheme does not support 
rekey on membership changes (regarding join operations). Additionally, ELK 
imposes some delay on the joining member before he receives the group key. 
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Table 2. Multiple Join Operation Equations. 



Scheme/ 

Resource 


Computation 


Message size 


KDC 


Join member 


Sib member 


Join unicast 


Sib unicast 


Multicast 


KHBT 


nG + (3n - 1)(X + H) + n(d + 1)E 


(d + 1)D 


(d + 1)(X + H) 


n ; (d + 1)K 


n : I 


(n - 1)/ 


PRGT 


2nG + (n - 1)H + n(d + 2)B 


(d + 1)D 


D + dH 


n : (d + 1)K 


n : I + K 


(n - 1)1 


HBT+ 


2nG + (n - 1)H + n{d + 2)E 


(d + 1)D 


D + dH 


n : (d + 1)K 


n : I + K 


(n - 1)/ 


OFT 


nG + (An - 2)(H + X) + 
(nd + 5n - 1)E 


(d + 1)D + 
d(H + X) 


2D+ 

d(H + X) 


n : (d-t- 1)K 


n : I + 2K 


(2n - 2)K‘ 


ELK 


(8n - 2)E and nG + n(d + 3)E 


(d + 1)D 


2dE and 2E 


n : (d + 1 )K 


n : I 


0 



Table 3. Single Leave Operation Equations. 



Scheme/ 

Resource 


Computation 


Multicast 


KDC 


Sib member 


EHBT 


d(X + H + E) 


d(X + H) 


I + dK 


PRGT 


(2d + 1)E 


D + dE 


I + (d+ 1)K 


HBT + 


2dB 


dD 


I + 2dK 


OFT 


d(H + X + E) 


D + d(H + X) 


/ + (d + 1)K 


ELK 


8dE 


dD + 5dE 


I + d(„^ + „2) 



Table i summarizes the KDC computation, sibling computation and multi- 
cast message size during single leave operations. We also analyse the equations 
of multiple leave operations, and we show the results in Table 0 For mass leav- 
ing, we consider the situation when exactly half of the group members leave the 
group. The sibling of every leaving member remains in the tree, and hence, all 
keys in the tree are affected. 



Table 4. Multiple Leave Operation Equations. 



Scheme/ 

Resource 


Computation 


Multicast 


KDC 


Sib member 


EHBT 


(2n - 1)(X + H) + (n - 1)E 


D + (d+ 1)(X -f H) 


nl + (n- 1)K 


PRGT 


(5n/2 - 2)E 


D + dB 


(3n/2 - l)^ 


HBT + 


(2n - 2)E 


dD 


nl + 2(n - 1)K 


OFT 


(2n - 2)H + (n - 1)X + (3n - 2)B 


(d + 1)D + d(H + X) 


nl + (3n - 2)K 


ELK 


(7n - 3)E 


dD + 5dB 


„/+(„_ l)(„j + „2) 



For leaving operations, again EHBT achieves better results than the other 
schemes regarding the computations involved, but loses out to ELK when com- 
paring the multicast message size. ELK has a slightly smaller multicast message 
than EHBT, because it sacrifices security. ELK uses only ni + ri 2 bits of a total 
K possible bits for generating a new key and this procedure weakens that key. 
Consequently, an expelled member needs to compute only 2”^+”^ possibilities to 
recover the new key. In EHBT, however, an expelled member needs to compute 
the full 2^ operations to brute-force the new key. 

We have simulated a group with 8192 members. For the calculations of the 
multiple join operations, we doubled the size of the group to 16384 members, 
and then we removed all joining members and finished with the 8192 original 
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members. We measured encryption and decryption times for the RC5 algorithm, 
MD5 hash function and a:or operation. We used 16-bit keys for the calculations. 
We used Java version 1.3 and lAIK ^ cryptographic toolkit on a 850Mhz Mobile 
Pentium III processor. It takes 1.72 • 10“^ ms for RC5 to encrypt a 16-bit key 
with a 16-bit key, and 1.73 • 10“^ ms to decrypt it. Hashing a 16-bit key takes 
4.95 • 10“^ ms and xoring it takes 1.59 • 10“^ ms. Finally, generating a 16-bit 
keys takes 7.33 • 10“3. Applying these numbers into Tables and 0 produces the 
results in Table that show that EHBT in general is faster to compute than the 
other protocols. 



Table 5. Time in Milliseconds for Multiple Joins and Leaves. 



Scheme/ 

Resource 


Multiple Join 


Multiple Leave 


KDC 


Join member 


Sib Member 


KDC 


Sib member 


KHBT 


2334 


0.25 


0.09 


248.03 


0.10 


PRGT 


2415 


0.25 


0.08 


352.22 


0.24 


HBT+ 


2415 


0.25 


0.08 


281.77 


0.22 


OFT 


2951 


0.35 


0.12 


516.78 


0.32 


ELK 


1140 + 2455 


0.25 


0.48 + 0.03 


1105.46 


1.34 



Finally, EHBT and the other schemes require the KDC to store 2n — 1 keys 
and members to store d -I- 1 keys. 

6 Security Considerations 

The security of the EHBT protocol relies on the cryptographic properties of the 
h function. One-way hash functions, unfortunately, are not proven secure |2|; 
nevertheless, for the time being, there has not been any successful attack on 
either the full MD5 [3 or SHA ^ algorithms m- 

Taking into account the use of hash functions as function h, attacks on the 
hidden key are limited to brute-force attack. Such an attack can take 2" hashes 
to find the original key, with n being the number of bits of the original key used 
as input. 

In order to guarantee backward secrecy and forward secrecy, every time there 
is a membership change, the keys related to joining members or leaving members 
are changed. 

When a member is added to the tree, all keys held by nodes in its ancestor 
set are changed to avoid giving the new member access to past information. For 
example, see FigureQ when member U 2 is inserted in the tree, key K 12 is created 
and keys and iFjg are refreshed. Node ri 2 does not have access to the old 
values, because it only receives the new key values, which were hidden by the 
hash function, and assuming the hash function is secure, ri 2 has no other way to 
recover the old key but brute- forcing it. The same rule applies when ri 2 leaves; 
key Ki 2 is deleted from the tree and keys and are updated and since 
ri 2 does not have access to their new values it does no longer has access to the 
group communication. 
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7 Conclusion 

Using one-way hash functions and xor operations, we constructed an efficient 
HBT protocol that achieves better overall performance than other HBT proto- 
cols. Our protocol, called EHBT, requires only (/ • log 2 n) message size for join 
operations and {K ■ log 2 n) message size for leaving operations. Additionally, 
EHBT requires the same key storage as other HBT protocols, and it requires 
much less computation to rekey the tree after membership changes. 
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A Reasoning on Using Index i in Function (F 

Index i is included in the formula T to avoid giving the possibility for members to 
have access to keys that they are not meant to. For example, removing member 
ri 2 in Figure E] means new keys k[ = U{ki), k[^ = TZ{k[) and fcjg = TZ{k[ 4 ^). 
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If, immediately after member ri2 has left the group, member no joins it and 
is inserted as a sibling of ni, then it means new keys k" = ki^ = T^(fci) 

(new node nio), ^"4 = U{k'i^ and fc"g = fi(fcjg). 

If we remove i from function T and instead only apply a simple hash h to 
update keys then the keys from the removal above become fcj = h{k\), k[^ = 
h{k[) (or h{h{ki))) and k[^ = h{k[^) (or h{h{h{ki)))) and the keys from the join 
become k” = h{k'^) (or h{h{ki))), fcio = h{k'l) (or h{h{h{k\)))) , k”^ = h{k'i^ 
and /c(g = h(k'i^). As one can see, key kw and fcjg are identical, which means 
that member no can have access to past messages encrypted with k[^ (or kio). 
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Abstract. IP multicast suffers from scalability problems for large num- 
bers of multicast groups, since each router keeps forwarding state propor- 
tional to the number of multicast tree passing through it. In this paper, 
we present and evaluate aggregated multicast, an approach to reduce mul- 
ticast state. In aggregated multicast, multiple groups are forced to share 
a single delivery tree. At the expense of some bandwidth wastage, this 
approach can reduce multicast state and tree management overhead at 
transit routers. It may also simplify and facilitate the provisioning of QoS 
guarantee for multicast in future aggregated-flow-based QoS networks. 
We formulate the tree sharing problem and propose a simple intuitive 
algorithm. We study this algorithm and evaluate the trade-off of aggrega- 
tion vs. bandwidth overhead using simulations. Simulation results show 
that significant aggregation is achieved while at the same time bandwidth 
overhead can be reasonably controlled. 



1 Introduction 

Multicast is a mechanism to efficiently support multi-point communications. IP 
multicast utilizes a tree delivery structure on which data packets are duplicated 
only at fork nodes and are forwarded only once over each link. Thus IP multi- 
cast is resource-efficient in delivering data to a group of members simultaneously 
and can scale well to support very large multicast groups. However, even after 
approximately twenty years of multicast research and engineering effort, IP mul- 
ticast is still far from being as common-place as the Internet itself. 

The deployment of multicasting has been delayed partly because of the seal- 
ability issues of the related forwarding state. In unicast, address aggregation 
coupled with hierarchical address allocation has helped to achieve scalability. 
This can not be easily done for multicasting, since a multicast address corre- 
sponds to a logical group and does not convey any information on the location 
of its members. A multicast distribution tree requires all tree nodes to main- 
tain per-group (or even per-group/source) forwarding state, and the number 
of forwarding state entries grows with the number of “passing- by” groups. As 

* This material is based upon work supported by the National Science Foundation un- 
der Grant No. 9805436 and No. 9985195, CISCO/CORE fund No. 99-10060, DARPA 
award N660001-00- 1-8936, TCS Inc., and DIMI matching fund DIMOO-10071. 
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multicast gains widespread use and the number of concurrently active groups 
grows, more and more forwarding state entries will be needed. More forwarding 
entries translate into more memory requirements, and may also lead to slower 
forwarding process since every packet forwarding involves an address look-up. 
This perhaps is the main scalability problem with IP multicast when the number 
of simultaneous on-going multicast sessions is very large. 

Recognition of the forwarding-state scalability problem has prompted some 
recent research in forwarding state reduction. Some architectures aim to com- 
pletely eliminate multicast state at routers m using network-transparent mul- 
ticast, which pushes the complexity to the end-points. Some other schemes at- 
tempt to reduce forwarding state by tunneling mi or by forwarding-state ag- 
gregation EEDI- Apparently, less entries are needed at a router if multiple for- 
warding state entries can be aggregated into one. Thaler and Handley analyze 
the aggregatability of forwarding state in m using an input/output filter model 
of multicast forwarding. Radoslavov et al. propose algorithms to aggregate for- 
warding state and study the bandwidth-memory tradeoff with simulations in jS|. 
Both these works attempt to aggregate routing state after the distribution trees 
have been established. 

We propose a novel scheme to reduce multicast state, which we call aggre- 
gated multicast. The difference with previous approaches is that we force multiple 
multicast groups to share one distribution tree, which we call an aggregated tree. 
This way, the number of trees in the network may be significantly reduced. Con- 
sequently, forwarding state is also reduced: core routers only need to keep state 
per aggregated tree instead of per group. The trade-off is that this approach may 
waste extra bandwidth to deliver multicast data to non-group-member nodes. 
Simulation results demonstrate that, the more bandwidth we sacrifice, the more 
state reduction we can achieve. The management policy and functional require- 
ments can determine the right point in this trade-off. In our earlier work we 
introduced the basic concepts of aggregated multicast. In this paper, we propose 
an algorithm to assign multicast groups to delivery trees with controllable band- 
width overhead. We also propose a model to capture the membership patterns 
of multicast users, which can affect our ability to aggregate groups. Finally, we 
study the trade-off between aggregation versus bandwidth overhead using series 
of simulations. 

The rest of this paper is organized as follows. Section 0 introduces the con- 
cept of aggregated multicast and discusses some related issues. Section 0 then 
formulates the tree sharing problem and presents an intuitive solution, and Sec- 
tion0provides a simulation study of our algorithm and cost/benefit evaluation. 
Finally Section 0 discusses the implications and contributions of our work. 



2 Aggregated Multicast 



Aggregated multicast is targeted as an intra-domain multicast provisioning mech- 
anism. The key idea of aggregated multicast is that, instead of constructing a 
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tree for each individual multicast session in the core network (backbone), one 
can force multiple multicast sessions share a single aggregated tree. 

2.1 Concept 




Fig. 1. Domain peering and a cross-domain multicast tree, tree nodes: Dl, Al, 
Aa, Ab, A2, Bl, A3, Cl, covering group Go (Dl, Bl, Cl). 

Fig. □ illustrates a hierarchical inter-domain network peering. Domain A 
is a regional or national ISP’s backbone network, and domain D, X, and Y are 
customer networks of domain A at a certain location (say, Los Angeles) . Domain 
B and C can be other customer networks (say, in New York) or some other ISP’s 
networks that peer with A. A multicast session originates at domain D and has 
members in domain B and C. Routers Dl, Al, A2, A3, Bl and Cl form the 
multicast tree at the inter-domain level while Al, A2, A3, Aa and Ab form an 
intra-domain sub-tree within domain A (there may be other routers involved in 
domain B and C). The sub-tree can be a PIM-SM shared tree rooted at an RP 
(Rendezvous Point) router (say, Aa) or a bi-directional shared CBT (Center- 
Based Tree) tree centered at Aa or maybe an MOSPF tree. Here we will not 
go into intra-domain multicast routing protocol details, and just assume that 
the traffic injected into router Al by router Dl will be distributed over that 
intra-domain tree and reaches router A2 and A3. 

Consider a second multicast session that originates at domain D and also has 
members in domain B and C. For this session, a sub-tree with exactly the same 
set of nodes will be established to carry its traffic within domain A. Now if there 
is a third multicast session that originates at domain X and it also has members 
in domain B and C, then router XI instead of Dl will be involved, but the 
sub-tree within domain A still involves the same set of nodes: Al, A2, A3, Aa, 
and Ab. To facilitate our discussions, we make the following definitions. We call 
terminal nodes the nodes where traffic enters or leaves a domain, Al, A2, and 
A3 in our example. We call transit nodes the tree nodes that are internal to the 
domain, such as Aa and Ab in our example. Using the terminology commonly 
used in DiffServ | 2 | , terminal nodes are often edge routers and transit nodes are 
often core routers in a network. 

In conventional IP multicast, all the nodes in the above example that are 
involved within domain A must maintain separate state for each of the three 
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groups individually though their multicast trees are actually of the same “shape” . 
Alternatively, in the aggregated multicast, we can setup a pre-defined tree (or 
establish on demand) that covers nodes Al, A2 and A3 using a single multicast 
group address (within domain A). This tree is called an aggregated tree (AT) 
and it is shared by more than one multicast groups. We say an aggregated tree 
T covers a group G if all terminal nodes for G are member nodes of T. Data 
from a specific group is encapsulated at the incoming terminal node. It is then 
distributed over the aggregated tree and decapsulated at exiting terminal nodes 
to be further distributed to neighboring networks. This way, transit router Aa 
and Ab only need to maintain a single forwarding entry for the aggregated tree 
regardless how many groups are sharing it. Furthermore, the use of aggregated 
multicast in one domain is transparent to the rest of the network. 

2.2 Discussion 

Aggregation reduces the required multicast state in a straightforward way. Tran- 
sit nodes don’t need to maintain state for individual groups; instead, they only 
maintain forwarding state for a smaller number of aggregated trees. On a back- 
bone network, core nodes are the busiest and often they are transit nodes for 
many “passing- by” multicast sessions. Relieving these core nodes from per- 
micro-flow multicast forwarding enables better scalability with the number of 
concurrent multicast sessions. 

The management overhead for the distribution trees is also reduced. First, 
there are fewer trees that exchange refresh messages. Second, tree maintenance 
can be a much less frequent process than in conventional multicast, since an 
aggregated tree has a longer life span. The control overhead reduction improves 
the scalability of multicast in an indirect yet important way. 

The problem of matching groups to aggregated trees hides several subtleties. 
The set of the group members and the tree leaves are not always identical. A 
match is a perfect or non-leaky match for a group if all the tree leaves are 
terminal nodes for the group, thus traffic will not “leak” to any nodes that are 
not group members. For example, the aggregated tree with nodes (Al, A2, A3, 
Aa, Ab) in Fig. ^is a perfect match for our early multicast group Gq which has 
members (Dl, Bl, Cl). A match may also be a leaky match. For example, if 
the above aggregated tree is also used for group G\ which only involves member 
nodes (Dl, Bl), then it is a leaky match since traffic for Gi will be delivered to 
node A3 (and will be discarded there since A3 does not have state for that group). 
A disadvantage of leaky match is that some bandwidth is wasted to deliver data 
to nodes that are not members for the group (e.g., deliver multicast packets to 
node A3 in this example). Leaky match may be unavoidable since usually it is 
not possible to establish aggregated trees for all possible group combinations. 

Aggregated multicast can be deployed incrementally and it can interoperate 
with traditional multicast. First, we can invoke aggregation on a need-to basis. 
For example, we can aggregate only when the number of groups in a domain 
goes over a threshold. We can also choose to aggregate only when the aggrega- 
tion causes reasonable bandwidth overhead, as we will discuss in detail in later 
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sections. Second, aggregated multicast can co-exist with traditional multicast. 
Finally, the aggregation happens only within a domain, while it is transparent 
to the rest of the network including neighboring domains. 

A related motivation for aggregated multicast is how to simplify the pro- 
visioning of multicast with QoS guarantees in future QoS-enabled networks. 
Regarding QoS support, per-flow-based traffic management requirement of Inte- 
grated Services does not scale. Today people are backing away from it and are 
moving towards aggregated flow based Differentiated Services The intrinsic 
per-flow nature of multicast may be problematic for DiffServ networks especially 
in provisioning multicast with guaranteed service quality. Aggregated multicast 
can simplify and facilitate QoS management for multicast by pre-assignment of 
resource/bandwidth (or reservation on demand) in a smaller number of shared 
aggregated trees. 

It is worth pointing out that our approach of “group aggregation” is funda- 
mentally different from the “forwarding-state aggregation” approaches in [Bmi!. 
We force multiple multicast groups to share a single tree, while their approach 
is to aggregate multiple multicast forwarding entries on each router locally. In 
a nutshell, we first aggregate and then route, while they first route and then 
aggregate. Note that the two approaches can co-exist: it is possible to further 
reduce multicast state using their approaches even after deploying our approach. 



3 The Tree Sharing Problem 

To implement aggregated multicast, two main problems must be worked out 
first: (1) what are the aggregated trees that should be established; (2) which 
aggregated tree is to be used for a certain group. In this section, we will formulate 
the tree sharing problem and propose a simple and intuitive algorithm; in the 
next section, we will present simulation results. 

3.1 Why the Problem? 

If aggregated multicast is only used by an ISP to provide multi-point connections 
among several routers that have heavy multicast traffic or that are strategically 
placed to carry inter-domain multicast, then a few number of trees can be pre- 
established and the matching (from group to a tree) is straightforward. The 
situation becomes complicated if aggregated multicast is used to a greater extend 
and the network is large. 

Given a network with n edge nodes (nodes that can be terminal nodes for 
a multicast group), the number of different group combinations is about 2" 
(=C^ + C^ + ... + C1^ = 2” — 1 — n, given that a group has at least two members), 
which grows exponentially with n. For a reasonably large network, it doesn’t 
make sense to establish pre-defined trees for all possible groups - that number can 
be larger than the number of possible concurrently active groups. So we should 
and can only establish a subset of trees out of all possible combinations. This 
is where leaky match comes into play. Meanwhile, if aggregated multicast is 
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used as a general multicast provisioning mechanism, then it becomes a necessity 
to dynamically manage and maintain aggregated trees since a static set of trees 
may not be very resource efficient all the time as groups come and go. For any 
solution one may have, the question is how much aggregation it can achieve and 
how efficient it is regarding bandwidth use. 

3.2 Aggregation Overhead 

A network is modeled as an undirected graph G{V, E). Each edge (z, j) is assigned 
a positive cost Cy = Cji which represents the cost to transport unit traffic from 
node i to node j (or from j to i). Given a multicast tree T, total cost to distribute 
a unit amount of data over that tree is 

(bi) € (1) 

If every link is assumed to have equal cost, tree cost is simply C(T) = \T\ — 1, 
where |T| denotes the number of nodes in T. 

Given a multicast group g and a tree T, we say tree T covers group g if 
all members of g are in-tree nodes of T (i.e., in the vertex set of T). If a group 
g is covered by a tree T, then any data packets delivered over T will reach all 
members of g, assuming a bi-directional tree. In a transport network, members of 
g are not necessarily group member routers (i.e., designated router with hosts 
in its subnet as group members), but rather they are edge routers connecting to 
other in-tree routers in neighboring domains. 

Now consider a network in which routing algorithm A is used to setup multi- 
cast trees. Given a multicast group g, let TA{g) be the multicast tree computed 
by the routing algorithm. Alternatively, this group can be covered by a aggre- 
gated tree T{g), aggregation overhead is defined as 

A{T,g) = C{T{g))-C{TA{g)). (2) 

Aggregation overhead directly reflects bandwidth waste if tree T{g) is used to 
carry data for group g instead of the conventional tree Ta (g) with encapsulation 
overhead not counted; i.e., bandwidth waste can be quantified as Dg x ^{T,g) 
if the amount of data transmitted is Dg. Note that, TA{g) is not necessarily 
the minimum cost tree (Steiner tree). Therefore, the aggregated tree T(g) may 
happen to be more efficient than TA{g), and thus it is possible for A{T, g) to be 
negative. 

3.3 Two Versions of the Problem 

Static Pre-Defined Trees. In this version of the problem, we are given: a 
network G{V,E), tree cost model G{T), a set of N multicast groups, and a 
number n {N » n). The goal is to find n trees (each of them covers a different 
node set) and a matching from a group g to a tree T{g) such that every group 
g is covered by a tree T(g), with the objective of minimizing total aggregation 
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overhead. This is the problem we need to solve to build a set of pre-defined 
aggregated trees based on long-term traffic measurement information. 

In reality, different groups may require different bandwidth and have different 
life time. Eventually they transmit different amounts of data (to all members, 
assumed). Aggregation overhead would be Dg x A(T, g) for group g which trans- 
mits Dg amount of data. However, if Dg is independent of group size and group 
membership, then statistically the end effect will be the same if all groups are 
treated as if they have the same amount of data to deliver. Then the total ag- 
gregation overhead is simply ^g A{T, g). An average percentage overhead 
can be defined as 



Eg^{T,g) EgC{T{g)) 
EgC{TA{g)) EgC{TA{g)) 



Dynamic Trees. The dynamic version of the problem is more meaningful for 
practical purposes. In this case, instead of a static set of groups, groups dynam- 
ically come and go. Our goal is to find a procedure to generate and maintain 
(establish, modify and tear down) a set of trees and map a group to a tree when 
the group starts, while minimizing the percentage aggregation overhead. 

If an upper bound is put on the number of trees that are allowed simultane- 
ously, apparently the procedure in the dynamic tree matching problem can be 
used to solve the static tree matching problem: the given set of (static) groups 
are brought up one by one (without going down) and the dynamic tree match- 
ing procedure is used to generate trees and do the mapping; the resulting set of 
trees and the corresponding mapping are the solution (for the static tree sharing 
problem) . 

In the static case, the number of groups given is finite and assumed to be N. 
In the dynamic case, similarly we can specify N to be the average or maximum 
number of concurrently active groups. In both the static and dynamic problems, 
N and n (number of trees allowed) affect aggregation overhead. Intuitively, the 
closer n to N, the smaller the overhead will be. When n = N, the overhead 
can be 0 since each group can be matched to the tree computed by the routing 
algorithm. The question is if we can achieve meaningful aggregation (N » n) 
while bandwidth overhead is reasonable. 



3.4 Dynamic Tree Sharing with Aggregation Overhead Threshold 
Control 

Here we present a solution for the dynamic tree sharing problem with the per- 
centage aggregation overhead statistically controlled under a given threshold. 

First we introduce some notations and definitions. Let MTS (multicast tree 
set) denote the current set of multicast trees established in the network. Let 
G{T) be the current set of active groups covered by tree T G MTS. Both MTS 
and G{T) evolve with time, and the time parameter is implied but not explicitly 
indicated. For each T G MTS, an average aggregation overhead for T is kept as 
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Sg G(T)^iT,g) ^ \G{T)\ X CjT) 

Sg in G{T) J2g in G{T) 



( 4 ) 



and is updated every time G{T) changes. |G(T)| denotes the rank of set G(T) (or, 
the number of groups covered by T). At a certain time t, the average aggregation 
overhead for all groups 6A{t){SA at time t) can be computed from 6{T, t){6{T) at 
time t) . If tree T is used to cover group g, the percentage aggregation overhead 
for this match is 



KT,g) 



C{T) - CjTAjg)) 
C{TA{g)) 



( 5 ) 



Let S'{T,g) denote the average aggregation overhead if group g is covered by T 
and is to be added to G{T), then 



S'{T,g) = 



(|G(T)| + 1) xG(T) 

Sg^ in {G(T),g} G(IA(5a:)) 



- 1 = 



(|G(T)| + l)xG(r) 

™gP + G(T^(5)) 



- 1 . ( 6 ) 



When a new multicast session goes on with group member set g, there are 
three options to accommodate this new group: (1) an existing tree covers g and 
is used to distribute packets for this new session; (2) an existing tree is extended 
to cover this group; (3) a new tree is established for this group. 

Let bt be the given bandwidth overhead threshold. The goal is to control 
(statistically) the total percentage overhead to be S{T) < bt, for each T G MTS, 
or 5a < bt (which is a weaker requirement). The following procedure determines 
how one of the above options is used: 

(1) compute the “native” multicast tree TA{g) for g{e.g., using shortest-path 
tree algorithm as in MOSPF); 

(2) for each tree T in MTS, if T covers g, compute 5'{T,g); otherwise com- 
pute an extended tree T® to cover g and then compute S'{T‘^,g); if S'{T,g) < bt 
or S'{T^,g) < bt, then T or T® is considered to be a candidate (to cover g); 

(3) among all candidates, choose the one such that C{T) or C{T^) + |G(T)| x 
(G(r®) — G(T)) is minimum, denote is as T^; Tm is used to cover g, update MTS 
(if Tm is an extended tree), G{Tm), and 5{Tm)', 

(4) if no candidate found in step (2), Ta(< 7) is used to cover g and is added 
to MTS and correspondingly G{TA{g)) and 5{TA{g)) are recorded. 

To extend tree T to cover group g (step (2)), a greedy strategy similar to 
Prim’s minimum spanning algorithm can be employed to connect T to nodes 
in g that are not covered, one by one. 

Since each group has a limited life time, it will not be using a tree after 
that. A simple clean-up procedure can be applied when a group goes off: when 
a multicast session g goes off, g is removed from G(T) where T is the tree used 
to cover g; if G(T) becomes empty, then T is removed from MTS; T is pruned 
recursively for nodes no longer needed in the tree; G(T) and 5{T) are updated. 
A node is no longer needed in tree T if it is a leaf and is not a member of any 
group g G G{T). 
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In the above algorithm description, we have assumed tree T is a bi-directional 
tree so that it can be used to cover any group whose members are all in-tree 
nodes of T. Apparently we can enforce that each tree is source-specific and each 
group needs to specify a source node, and the above algorithm still applies except 
that we may turn out to have more trees. 



Bandwidth- Aware Aggregation. In all the aggregation overhead definitions 
we had above, bandwidth requirement of a multicast session is not considered. 
This is in agreement with today’s IP multicast routing architecture where a 
group’s bandwidth requirement is unknown to the routing protocols. At the 
same time, we assumed both bandwidth requirement and lifetime of a group 
are independent of the group size and member distribution. If bandwidth re- 
quirement is given for each multicast session (e.g., in future networks with QoS 
guarantee), the above algorithms can be extended to consider the bandwidth in a 
straightforward way. Due to space limitation, we do not present this formulation 
here. 



3.5 Performance Metrics 



We use the following metrics to quantify the effectiveness of an aggregation 
method. 

Let N (t) be the number of active multicast groups in the network at time t 
and M{t) the number of trees, aggregation degree is defined as 



AD{t) 



m) 

M{ty 



( 7 ) 



AD is an important indication of tree management overhead reduction. For 
example, the number of trees that need periodical refresh messages to keep state 
is reduced from N to 



Average Aggregation Overhead is 



5A{t) 



Y.gC{T{g)) 

^TGMTS \G{T)\xC{T) 
y C{Ta{q)) W \g(t)\xC(t) 



(8) 



as defined in last subsection. 5a reflects the extra bandwidth wasted to carry 
multicast traffic using shared aggregated trees. 

Without loss of generality, we assume that a router needs one routing en- 
try per multicast address in its forwarding table. Here we care about the total 
number of state entries that are installed at all routers involved to support a 
multicast group in a network. In conventional multicast, the total number of en- 
tries for a group equals the number of nodes |T| in its multicast tree T (or subtree 
within a domain, to be more specific) - i.e., each tree node needs one entry for 
this group. In aggregated multicast, there are two types of state entries: entries 
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for the shared aggregated trees and group-specific entries at terminal nodes. The 
number of entries installed for an aggregated tree T equals the number of tree 
nodes |T| and these state entries are considered to be shared by all groups 
using T . The number of group-specific entries for a group equals the number of 
its terminal nodes because only these nodes need group-specific state. 

Furthermore, we also introduce the concept of irreducible state and re- 
ducible state: group-specific state at terminal nodes is irreducible. All ter- 
minal nodes need such state information to determine how to forward multicast 
packets received, no matter in conventional multicast or in aggregated multicast. 
For example, in our early example illustrated by Fig. Q node A1 always needs 
to maintain state for group Gq so it knows it should forward packets for that 
group received from D1 to the interface connecting to Aa and forward packets 
for that group received from Aa to the interface connecting to node D1 (and not 
XI or Yl), assuming a bi-directional inter-domain tree. 

Given a set of groups Q, if each group g is serviced by a tree TA{g), then the 
total number of state entries is 



iV^ = ^|T^(5)|. (9) 

geO 

Alternatively, if the same set of groups are serviced using a set of aggregated 
trees MTS, the total number of state entries is 

Nt= Y. I^l + El 5 l^ ( 10 ) 

TgMTS qGG 



where |T| is the number of nodes in T, and I5I is the group size of g. The first 
part of ( cnj represents the number of entries to maintain for the aggregated 
trees, while the second part denotes the number of entries that source and exit 
nodes of a group need to maintain in order to determine how to forward and 
handle multicast data packets. Overall state reduction ratio can be defined 
as 



Nt _ 'l2TeMTS 1^1 + I5I 

EgeG\TA{g)\ 



A better refiection of state reduction achieved by our “group aggregation” ap- 
proach, however, is the reducible state reduction ratio, which is defined as 



SrgMTS 1^1 

YeGi\TA{g)\-\g\y 



( 12 ) 



i.e., the total number of entries needed to be maintained by transit nodes has 
been reduced from Eg6e(I^A(5)l ~ Isl) to Etgmts I^I- 
Another metric called hit ratio is defined as 



HR{t) = 



number of groups covered by existing trees 
total number of groups 



Nh{t) 

m) ■ 



(13) 
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Both Nh and Nt are accumulated from time 0 (start of simulation) to time t. 
The higher HR{t), the less often new trees have to be setup to accommodate 
new groups. Similarly extend ratio is defined as 



ER{t) 



number of groups covered by extended trees 
total number of groups 



Nejt) 

Ntit)- 



(14) 



The “cost” to extend an existing is expected to be lower than setting-up a new 
tree. Percentage of groups that require to establish new trees is 1 — HR — ER, 
up to time t. 



4 Simulation Studies 

In this section, we evaluate our approach by studying the trade-off between 
aggregation and bandwidth overhead using simulations. We find that we can 
achieve significant aggregation for reasonable bandwidth overhead. We attempt 
to test our approach in a wide variety of scenarios. Given the absence of large 
scale real multicast traces, we are forced to develop membership models that 
exhibits a locality and correlated group preferences. 

4.1 Multicast Group Models 

The performance of any aggregation is substantially affected by the distribution 
of multicast group members in the network. Currently multicast is not widely 
deployed and its usage has been limited, thus, trace data from real multicast 
sessions is limited and can only be considered as an indication of multicast 
patterns in large scale multicast. We develop and use the following different 
models to generate multicast groups. 

In most multicast routing research literature, members of a multicast group 
are randomly chosen among all nodes. In this model, not only group members 
are assumed to be uncorrelated, but all nodes are treated the same as well. This 
well reflects member distribution of many applications, such as Internet gaming, 
but not all of them. In some applications, members tend to cluster together; for 
example, an Internet broadcast of Laker’s basket ball game is watched by its local 
fans around Los Angeles area. Inter-group correlation is also an important factor; 
for example, members of a multicast group might tend to be in another group as 
well H21. Neither does this model reflect the fact that not all nodes in the network 
are equivalent. For example, consider two nodes in MCI’s backbone network: one 
is in Los Angeles and the other one is in Santa Barbara. It is very likely that 
the LA node has much more multicast sessions going through it than that of the 
Santa Barbara node given that MCI has a much larger customer base in LA. 
Besides, there can be historic correlation of group distribution among network 
nodes as well. For example, a customer company of MCFs Internet service has 
three locations in LA, Seattle and Houston, which are connected through MCFs 
backbone. There are often video conferences among these three sites, and when 
there is one, MCFs routers at the three places will be in a multicast session. 
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From the above discussions, to model multicast group distribution, the fol- 
lowing factors have to be considered: (1) member distribution within a group 
(e.g., spread or clustered); (2) inter-group correlation; (3) node difference in mul- 
ticast participation; (4) inter-node correlation; (5) group size distribution; i.e, 
how often we tend to have very small groups or very large groups. Factor (5) 
has been not discussed above, but clearly it is very important as well. Several 
models are described in m where factors (1) and (2) are considered. The terms 
of affinity and disaffinity are used in [ 7 ] to describe the clustering and spreading 
out tendencies of members within a group. 

In our work, we use the node weighted framework which incorporates the 
difference among network nodes (factor (3)). In this framework, each node is 
assigned a weight representing the probability for that node to be in a group. 
We have two models to generate groups based on node weight assignment, which 
gives rise to two different models. 



The Random Node- Weighted Model. This model statistically controls the 
number of groups a node will participate based on its weight: for two nodes i 
and j with weight w{i) and w{j) (0 < w(i),w{j) < 1), let N{i) =the number of 
groups that have z as a member and N(j) =the number of groups that have j as 
a member, then in average. Assuming the number of nodes in the 

network is JV and nodes are numbered from 1 to JV. For each node z, 1 < z < IV, 
it is assigned a weight w(i), 0 < w(i) < 1. Then a group can be generated as the 
following procedure: 
for i = 1 to N do 

generate a random number between 0 and 1, let it be p 
if p < w(i) then 

add i as a group member 

end if 
end for 



The Group-Size Controlled Model. In this model, we want to have more 
accurate control over the size of the groups we generate. For this reason, we use 
the following procedure to generate groups with a given group size that follows 
a given probability mass function pmf px{x): 
generate group size n according to px (x) 
while the number of member is less than n do 
randomly pick up a non-member, let it be i 
generate a random number between 0 and 1, let it be p 
if p < w{i) then 

add i as a group member 
end if 
end while 
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This model controls the group-size distribution; however, nodes no longer par- 
ticipate in groups according to their weights (i.e., we no longer have 
in average). 



4.2 Simulation Results 

We present results from simulation using a network topology abstracted from a 
real network topology, AT&T IP backbone Q, which has a total of 123 nodes: 
9 gateway routers, 9 backbone routers, 9 remote GSR (gigabit switch router) 
access router, and 96 remote access routers. 

The abstract topology is constructed as follows. First, we “contract” all the 
attached remote access routers of a gateway router or a backbone router into 
one node (connecting to the original gateway/backbone router), which is called 
a contracted node. Since a gateway router in the backbone represents con- 
nectivity to other peering network(s) and/or Internet public exchange point(s), 
a neighbor node called exchange node is added to each gateway router to 
represent such external connectivity. The result is a simplified network with 54 
nodes. Among these nodes, gateway nodes (9 of them) and backbone nodes (9 
of them) are assumed to be core routers only (i.e., will not be terminal nodes for 
any multicast group) and are assigned weight 0. Each access router is assigned 
weight 0.01, and a “contracted” node’s weight is the summation of the weights 
of all access routers from which it is contracted. Exchange nodes are assigned 
weight ranging from 0.1 to 0.9 in different simulation runs. 

In simulation experiments, multicast connection requests arrive as a Poisson 
process with arrival rate A. Each time a connection comes up, all group members 
are specified. Group membership is generated using the node weighted framework 
discussed in Section lOl Gonnections’ life time has a Poisson distribution with 
average /r. At steady state, average number of connections is N = X x fi. The 
algorithm specified in last section is used to establish/mantain trees and map a 
group to a tree. The routing algorithm A is shortest-path tree algorithm with 
root randomly chosen from all group members. Performance data is collected at 
certain time points (e.g., at T = lO/r), when stead state is reached, as “snapshot” . 




Fig. 2. Aggregation vs. bandwidth overhead threshold, groups are generated 
using group-size-controlled model. 
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In our first experiment, an exchange node is assigned a weight 0.25 or 0.9 
according to link bandwidths of the original gateway- the rationale is that, the 
more the bandwidth on the outgoing (and incoming) links of a node, the more 
the number of multicast groups it may participate. Groups are generated using 
the group-size-controlled model. Group size is uniformly distributed from 2 to 
36 and the average number of concurrently active groups is A/r = 1000. Fig. 0 
shows the results of aggregation degree (a), state reduction (b) and hit/extend 
ratios (c) vs. bandwidth overhead threshold. We can see that, aggregation degree 
increases as the bandwidth threshold is increased - if we are willing sacrifice more 
bandwidth, we can accommodate more multicast groups into a shared aggregated 
tree. Apparently this agrees with our intuition. Fig. shows that, overall state 
reduction ratio and reducible state ratio also increase with bandwidth overhead 
threshold - as we squeeze more groups into an aggregated tree, we need fewer 
trees and achieve more state reduction. Fig. 0(c) tells us that, hit ratio goes up 
and extend ratio goes down with increasing threshold. This is consistent with 
the trend of other metrics (aggregation degree and state reductions). When more 
groups can share an aggregated tree, it is more likely for an incoming group to 
be covered by an existing tree and thus it becomes less often to setup new trees 
or extend existing trees. 




Maximum size of multicast groups Maximum size of multcast groups Maximum size of multicast groups 

(a) (b) (c) 



Fig. 3. Aggregation vs. maximum size of multicast groups, groups are generated 
using group-size-controlled model. 



In our second experiment, we keep the bandwidth overhead threshold at 
a fixed value (=0.3 for results presented here) and vary the upper bound of 
group size (still uniformly distributed with lower bound 2) while keep all other 
parameters the same as in the first experiment. The results in Fig.|^demonstrate 
the effect of group size on aggregation: if there are more larger groups, then we 
can aggregate more groups into sharing trees. As groups become larger, so do 
their multicast trees. A larger tree can “cover” more groups than a smaller one 
under the same overhead threshold (i.e., there are more subtrees of a larger tree 
within that threshold). 

We want to see how node weight affects aggregation. Here we also keep the 
bandwidth overhead threshold at a fixed value (=0.3 for results presented here) 
and group size is uniformly distributed from 2 to 36. All other parameters are 
the same as in our first experiment, while we vary the weight of exchange nodes 
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Weight of exchange node 

(a) 



Weight of exchange node 
(b) 



Weight of exchange node 
(c) 



Fig. 4. Aggregation vs. weight of exchange nodes, groups are generated using 
group-size-controlled model. 



from 0.1 to 0.9. The results are shown in Fig. g] The higher the weight of a node, 
the larger the number of groups it may participate. As we increase the weights 
of those exchange nodes, multicast groups are more likely to “concentrate” on 
these nodes, and better aggregation is achieved as the results show. 

The same set of experiments are also conducted using the random node- 
weighted model. In Fig. 0 we plot the results of an experiment similar to our 
first one. The results demonstrate similar trends, although the actual values 
differ. Note that the state reduction seems to be comparable. 




Fig. 5. Aggregation vs. bandwidth overhead threshold, groups are generated 
using random model. 



We also examine how the aggregation scales with the (statistic average) num- 
ber of concurrent groups. We run simulations with different products while 
keep all other parameters fixed. Fig. plots the results for two different band- 
width overhead thresholds. It is no surprise that as more groups are pumped 
into the network, the aggregation degree increases - in average, more groups 
can share an aggregated tree. The scaling trend is encouraging: as the aver- 
age number of (concurrent) groups is increased from 1000 to 9000, the number 
of aggregated trees is increased from 29 to 46 only (with bandwidth overhead 
threshold 0.3). At the same time, reducible state reduction ratio is getting close 
to 1. 

To summarize, we observe that the state reduction can be significant up to 
50% for the overall state. Furthermore, we can get significant reduction even for 
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Fig. 6. Aggregation vs. concurrent group number, groups are generated using 
random model. 



small bandwidth overhead. Finally, our approach has the right trend: the state- 
reduction increases as the number and the size of groups increase. This way, the 
aggregation becomes more effective when it is really needed. 

We also have to warn the limitations of such simulation studies. As we have 
found, multicast group models (size distribution and member distribution as 
controlled by node weights) can significantly affect aggregation; thus how ag- 
gregated multicast is going to work out in real networks depends a lot on such 
factors in practice. Therefore, it is important to develop realistic multicast sce- 
narios to evaluate any aggregation approach. 

5 Conclusions and Future Work 

We propose a novel approach to address the problem of multicast state seal- 
ability. The key idea of aggregated multicast is to force groups into sharing a 
single delivery tree. This comes in contrast to other forwarding-state aggrega- 
tion approaches that first create multiple trees and then try to aggregate the 
state locally on each router. A key concept in our approach is that it sacrifices 
bandwidth to reduce the routing state. 

Our work could be summarized in the following points: 

— We introduce the concept of aggregated multicast and discuss several related 
issues. 

~ We formulate the tree sharing problem and present a simple and effective 
algorithm to establish aggregated trees and dynamically match groups with 
existing trees. 

— We propose performance metrics that can be used to evaluate our approach. 

~ We show that our approach seems to be very promising in a series of simula- 
tion experiments. We can achieve significant state aggregation (up to 50%) 
with relatively small bandwidth overhead(10% to 30%). 

Our work suggests that the benefits of aggregated multicast lie in the follow- 
ing two areas: (1) control overhead reduction by reducing the number of trees 
needed to be maintained in the network; (2) state reduction at core nodes. While 
the price to pay for that is bandwidth waste. Our simulation results confirm our 
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claim while demonstrate the following trends: (1) as we are willing to sacrifice 
more bandwidth (by increasing the bandwidth overhead threshold), more or bet- 
ter aggregation is achieved; (2) better aggregation is achievable as the number 
and size of concurrent groups increases. The last is specially important since one 
basic goal of aggregated multicast is to achieve better scalability regarding the 
number of concurrent groups. 



Future Work. Our scheme simplifies multicast management and could lend 
itself to a mechanism of QoS provisioning and multicast traffic engineering with 
the appropriate use of DiffServ or MPLS. We find that this by itself could be a 
sufficient motivation for studying aggregated multicast. 
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Abstract. There has been considerable activity recently to develop mon- 
itoring and debugging tools for a multicast session (tree). With these 
tools in mind, we focus on the problem of how to lay out multicast ses- 
sions so as to cover a set of links of interest within a network. We define 
two variations of this layout (cover) problems that differ in what it means 
for a link to be covered. We then focus on the minimum cost problem, 
to determine the minimum cost set of trees that cover the links in ques- 
tion. We show that, with few exceptions, the minimum cost problems 
are NP-hard and that even finding an approximation within a certain 
factor is NP-hard. One exception is when the underlying network topol- 
ogy is a tree. For this case, we demonstrate an efficient algorithm that 
finds the optimal solution. We also present several computationally ef- 
ficient heuristics and their evaluation through simulation. We find that 
two heuristics, a greedy heuristic that combines sets of trees with three 
or fewer receivers, and a heuristic based on our tree algorithm, both per- 
form reasonably well. The remainder of the paper applies our techniques 
to the vBNS network and randomly generated networks, examining the 
effectiveness of the different heuristics. 



1 Introduction 

Multicast is a technology that shows great promise for providing the efficient 
delivery of content from a single source to many receivers. An interoperable net- 
working infrastructure is nearly in place (PIM-SM/MSDP/MBGP,SSM) and the 
development of mechanisms for congestion control and reliable data delivery are 
well under way [4lf)j . However, deployment of multicast applications lags behind, 
in large part because of a lack of debugging and monitoring tools. Recently, sev- 
eral promising approaches and protocols have been proposed for the purpose of 
aiding the network manager or the multicast application designer in this task. 
These include the use of end-to-end measurements for inferring internal behavior 
on a multicast tree |2] , the development of the multicast route monitor (MRM) 
protocol PI, and a number of promising fault monitoring tools, |i;illbj . All of 
these address the problem of identifying performance and/or fault behavior on 
a single multicast tree. 

* This work is sponsored in part by the DARPA and Air Force Research Laboratory 
under agreement F30602-98- 2-0238. 
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Although considerable progress has been made in developing tools for a single 
tree, little attention has been paid on how to apply these tools to monitor an 
entire network, or even a subset of the network. We address this problem; namely, 
given a set of links whose behavior is of interest, how does one choose a set of 
minimum cost multicast trees within the network on which to apply these tools 
so as to determine the behavior of the links in question? The choice of trees, of 
course, is determined by the multicast routing algorithm. This raises a related 
question, namely, does the multicast routing algorithm even allow a set of trees 
that that will allow one to determine the behavior of the links of interest. We refer 
to this latter problem as the Multicast Tree Identifiability Problem (MTIP) and 
the first problem as the Minimum cost Multicast Tree Cover Problem (MMTCP). 

We refer to the behavior measurement of a link as a link measure. Note that 
solutions to MTIP and MMTCP depend on details of the mechanism used to 
determine the link measures. Consequently, we introduce two versions of these 
problems, the weak and the strong cover problems. Briefly, the weak cover prob- 
lem is based on the assumption that it is sufficient that each link of interest 
appear in at least one tree. The strong cover problem requires that each link 
occur between two branching points in at least one tree. 

Briefly, the paper makes the following contributions. 

— We establish that the cover problems are NP-hard and that in some cases, 
finding an approximation within a certain factor of optimal is also NP-hard. 
Thus, we also propose several heuristics and show through simulation that 
a greedy heuristic that iteratively combines trees containing a small number 
of receivers performs reasonably well. 

— We provide polynomial time algorithms that find optimal solutions for a 
restricted class of network topologies, including trees. This algorithm can be 
used to provide a heuristic for sparse, tree like networks. This heuristic is 
also shown through simulation to perform well. 

— We apply our techniques to the vBNS network and randomly generated 
networks, examining the effectiveness of the different heuristics. 

The remainder of the paper proceeds as follows. Section El presents the model 
for MTIP and MMTCP, as well as the two types of covers we consider. Section 
13 introduces several approximation algorithms and heuristics for MMTCP. In 
Section El we present efficient algorithms that find the optimal MMTCP solution 
for the special case where the underlying network topology is a tree. Section El 
presents the results of simulation experiments on the VBNS network and ran- 
domly generated networks. Last, Section El concludes the paper. 

2 Model and Assumptions 

We represent a network by a directed graph N = (V{N), E{N)) where V{N) 
and E{N) denote the set of nodes and links within N respectively. When un- 
ambiguous, we will omit the argument N. Our interest is in multicast trees 
embedded within N. Let S C V{N) be a set of potential multicast senders, and 
R CV (N) a set of potential multicast receivers. Let T = {V (T), E{T)) denote a 
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directed (multicast) tree with a source s(T) and a set of leaves r(T). We require 
that s{T) G S and r(T) C R. Let ^ be a mapping that takes a source s G S 
and receiver set r C R and returns a tree A(s, r). In the context of a network, A 
corresponds to the multicast routing algorithm. Examples include DVMRP na, 
and PIM-DM and PIM-SM 0. Let T{A,S,R) = {A(s,r) : s G S,r C i?\ {s}}, 
i.e., T{A, S, R) is the set of all possible multicast trees that can be embedded in 
N using multicast routing algorithm A. We shall henceforth denote T{A, S, R) 
by T{S, R), omitting the dependence on A. 

We associate the following cost with a multicast tree T G T{S,R), 

C{T) = C°+ W 

ieE{T) 

where the first term is a “per tree cost” and the second is a “per link cost”. 
For example, IP multicast requires each multicast router to maintain per flow 
state. This is accounted for by the per tree cost. The per link cost is the cost for 
sending probe packets through a link. The two problems of interest to us are as 
follows: 



Multicast Tree Identifiability Problem. Given a set of multicast trees W C 
T(S', i?), and a set of links L C E, is L identifiable by the set of trees Wl 



Minimum Cost Multicast Tree Cover Problem. Given S, R C V , L C E 

and L is identifiable by T{S,R), what is the minimum cost subset of T{S,R) 
sufficient to cover L? In other words, find W C T{S,R) that covers L and mini- 
mizes 

C(tf') = ^ C(T) 

TGtf' 

We distinguish two types of solutions to both of these problems. These so- 
lutions differ in what exactly is meant by a cover. We say that a node u is a 
branch point in tree T if u is either a root or a leaf, or v has more than one child. 
A path I = (ui, V 2 , ■ ■ ■ ,Vn) is said to be a logical link within T if vi and Vn are 
branch points, U 2 , . . . , Vn-i are not, and(ui, Uj+i) G E(T), i = 1, . . . , n — 1. 

— Strong cover: Given a set of trees W, E is the strong cover of link I = (u, v) 
if there exists a,T G E such that both u and v are branch points in T. E is 
the strong cover of a link set L if\/l G L, is the strong cover of 1. 

— Weak cover: Given a set of trees S', E is the weak cover of link I if there 
exists & T G E such that I G E{T). We say that <E is the weak cover of a link 
set L if yi G L, <E is the weak cover of 1. 

We refer to the problems of finding these types of solutions as S-MTIP/S- 
MMTCP and W-MTIP/W-MMTCP respectively. Several cases are of interest to 
us. One is where L = E, i.e., where the objective is to cover the entire network. 
A second is where L consists of one link, \L\ = 1. If, VZ, we set C{1) = 0, 
the problem becomes that of covering the link set L with the set of trees with 
minimum total per tree cost. 
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The solutions to S-MTIP and W-MTIP are straightforward and are found 
in p. Henceforth, Cov{^, COVER) is a function, described in P, that returns 
the maximum set of links identifiable by the set of trees E where COV ER can 
be either ’strong’ or ’weak’. 

In this paper, we assume that both the network topology and multicast rout- 
ing algorithm are given. Most debugging and monitoring is performed by net- 
work operators. They either know the exact topology or can easily discover it. 
They then can apply the network multicast routing algorithm to obtain the set 
of possible trees. End-users often have access to topology information and can 
apply tools such as mtrace (e.g., P) to identify the set of possible trees. Our 
model doesn’t account for temporary routing changes. However, according to 
measurements in HH, these occur very infrequently as compared to the dura- 
tion of a debugging session. Moreover, once a routing change is detected (e.g. 
by running mtrace periodically), one can always re-apply our techniques on the 
new topology and obtain a new layout of multicast trees. 

3 Approximating the Minimum Cost Multicast Tree 
Cover Problem 

We have defined two types of Multicast Tree Cover Problems: the S-MMTCP 
and the W-MMTCP. Unfortunately, as the following theorem shows, not only 
are these problems NP-Complete, we cannot even expect to find a good quality 
approximation to these problems. In particular, we can demonstrate the following 
(the proof is found in P): 

Theorem 1. For each of S-MMTCP and W-MMTCP, it is NP-Hard to find 
a solution that is within a factor of (1 — e)ln|L| of the optimal solution, for 
any e > 0. These problems are also still NP-Hard even with the restriction that 
\a,C{l) = 0. 

Since we cannot expect to solve these problems exactly, in the remainder of 
this section, we focus on approximation algorithms and heuristics for good solu- 
tions. In Section we focus on approximation algorithms for the case where 
the goal is to minimize the total cost of setting up the multicast trees. We provide 
polynomial time algorithms that have a provable bound on how close to opti- 
mal the resulting solution is for the S-MMTCP and the W-MMTCP. In Section 
m we describe extensions to these algorithms for the problem of approximat- 
ing the general MMTCP. The resulting algorithms only run in polynomial time 
when the number of possible receivers is 0(log|if|), and thus, to approximate 
the general MMTCP more efficiently, we also propose a heuristic that always 
finds a solution to the general MMTCP in polynomial time. In Section 0 we 
experimentally verify the quality of the solutions found by this heuristic. 

3.1 Minimizing the Total Per- Tree Cost 

We first study how to approximate the MMTCP when the goal is only to mini- 
mize the total cost for setting up the multicast trees but not the cost for multicast 
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traffic to travel links, i.e., C; = 0 in (1). This problem is simpler, since without a 
cost for link traffic, if a sender is performing a multicast, there is no additional 
cost for sending to every receiver. Thus, we can assume that any active sender 
multicasts to every possible receiver. Note, however, that by Theorem Cl even 
this special case is NP-Hard to approximate within better than a In |L| factor. 
We describe algorithms for this problem that achieve exactly this approximation 
ratio. These algorithms rely on the fact that when C{1) = 0,VZ, then both the 
W-MMTCP and the S-MMTCP can be solved using algorithms for the weighted 
Set-Cover problem, which is defined as follows: 

— The weighted Set- Cover problem: given a finite set B = {ei, 62 , ..., Cm} 
and a collection of subsets of B, B = {Bi, B 2 , ■■■, Bn}, where each Bi C B 
is associated with a weight Wi, find the minimum total weight subset T = 
{Bi, Bk} C T such that each Ci G B is contained in some Bj G T . 

To use algorithms for the weighted Set-Cover problem to solve the S-MMTCP 
or W-MMTCP, we simply set B — L, Bi = {I G L such that I is in the cover 
(strong or weak, respectively) produced by Ti, where Ti is the tree produced 
by sender i multicasting to every receiver}. The weight of Bi is the per-tree 
cost of multicasting from sender i. Any solution to the resulting instance of the 
weighted Set-Cover problem produces a S-MMTCP (W-MMTCP, resp.) solution 
of the same cost. Using this idea, we introduce two algorithms for the MMTCP: 
a greedy algorithm modeled after a weighted Set-Cover algorithm analyzed by 
Chvatal j0|, and an algorithm that uses 0-1 integer programming, constructed 
using a weighted Set-Cover algorithm analyzed by Srinivasan m- 
Greedy Algorithm: The intuition behind the greedy algorithm is simple. As- 
sume first that the per-tree cost is the same for every multicast tree. In this case, 
the algorithm, at every step, chooses the multicast tree that covers the most re- 
maining uncovered links. This is repeated until the entire set of links is covered. 
When different trees have a different per-tree cost, then instead of maximizing 
the number of new links covered, the algorithm maximizes the number of new 
links covered, divided by the cost of the tree. Intuitively, this maximizes the 
’’profit per unit cost” of the new tree that is added. 

The details of the algorithm are shown in Figure ^ This algorithm is easily 
seen to run in polynomial time for all three types of covers. For the W-MMTCP 
and S-MMTCP, Theorem |21 provides a bound on how good an approximation 
the algorithm produces. 

Theorem 2. For any instance I of S-MMTCP or W-MMTCP with C{1) — 0, V/, 
the greedy algorithm finds a solution of cost at most (in d -\- | -|- • OPT, where 

d = min (|L|, max'Fj^ OPT is the cost of the optimal solution to I. 

We see from Theorems Q and 13 that the performance of the greedy algorithm 
is the best that we can hope to achieve. However, these theorems only apply 
to the worst case performance; for the average case, the performance may be 
much better, and the best algorithm may be something completely different. We 
investigate this issue further by introducing a second approximation algorithm, 
based on 0-1 integer programming. We shall see in Sectional that the 0-1 integer 
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1. Compute ^ using the multicast routing protocol A where ip’ the set of 
all multicast trees from a source in S to every receiver in R. \/Ti G W , 
set its cost Ci = C^(Ti). 

2. Set J = ^,J = 9, L 

3. If L = L, then stop and output J. 

4. VTi G J\J , set Li = Cov{J U {Ti},COV ER) . Find Ti G J\J that maximizes 
\{LinL)\L\/ci. 

5. L = Lyj (LiC[ L) , J = Ju{Ti}. Go to step 3. 

Fig. 1. The Greedy Algorithm to Approximate MMTCP. 



programming algorithm performs better than the greedy algorithm in some cases. 
The details of this algorithm are omitted from this version of the paper due to 
space limitations and can be found in J here we only state the following theorem 
that demonstrates how good a solution is provided by this approach. The proof 
of the theorem is also presented in p. 

Theorem 3. For any instance I of either the S-MMTCP or the W-MMTCP, 
the 0-1 linear programming algorithm finds a solution of cost at most OPT(l + 
0(max{ln(m/OPT), ^^/lIl(rn/ O PT)})) , where OPT is the optimal solution to I. 

We also point out that in the Set-Cover problem, if we let d = max|i?i|, 
then even for c? = 3, the set cover problem is still NP-hard. However, it can be 
solved in polynomial time provided that d = 1 or d = 2. Since we can transform 
the S-MMTCP and the W-MMTCP to Set-Cover problems, we know that the 
S-MMTCP and the W-MMTCP can be solved in polynomial time given that 
maxi \L n E{Ti)\ < 2 where Ti is the tree produced by sender i multicasting to 
every receiver. 

3.2 Minimizing the Total Cost 

We next look at the general MMTCP, where the goal is to minimize the total 
cost. In addition to the per-tree cost of the multicast trees used in the cover, this 
includes the cost of multicast traffic traveling on the links used by the trees. Both 
the greedy algorithm and 0-1 integer programming algorithm can be extended to 
approximate the general problem. For the greedy algorithm, we simply replace 
the first step with: 

1. Compute F using the multicast routing protocol A. where 'E is the set 
of all multicast trees from a source in S to any subset of R. \/Ti G E, 
compute its cost a = C(Ti). 

The bound from Theorem El on the quality of approximation achieved by 
the greedy algorithm also applies to this more general algorithm. However, the 
greedy algorithm has a running time that is polynomial in \E\. For the algorithm 
of Section O W\ = I5|, which results in a polynomial time algorithm, but for 
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the more general algorithm considered here, \^\ = 151 • — 1). Thus, the 

more general approximation algorithm only has a polynomial running time when 
|i?| = 0(log n), where n is the size of the input to the MMTCP problem (i.e., the 
description of N, S, R and L). The analogous facts also apply to the 0-1 integer 
programming algorithms. 

In order to cope with large values of \R\ in the general MMTCP, we also 
introduce the fast greedy heuristic, which always runs in polynomial time. Fast 
greedy is like the greedy algorithm, except that instead of considering all possible 
multicast trees (i.e., every tree from a sender to a subset of the receivers), it 
restricts itself to only those trees that contain 3 receivers (or, in the case of 
a weak cover, 1 receiver). There will be at most a polynomial number of such 
trees. Fast greedy then uses the greedy strategy to choose a subset of these trees 
covering all required links, and then merges the trees with the same sender. The 
details of this heuristic are described in Figure El We shall see in Section El that 
the performance of the fast greedy heuristic is often close to that of the greedy 
algorithm. 



1. If COVER = ‘strong’, apply the multicast routing protocol A to 
compute the set of all multicast trees that have one sender in S and 
three receivers in R. 

If COVER = ‘weak’ , apply the multicast routing protocol A to compute V', 

the set of all multicast trees (paths) that have one sender in S and one 

receiver in R. 

\/Ti £ V', compute its cost a = C{Ti). 

2. For all , set Ei = E{Ti) . 

3. Set J = ^,J = 9,L = 0. 

4. It L — L , then aggregate all the trees in J who share the same source 
node Euid output J. Stop. 

5. For all Ti £ J \ J, set V'i = J U {Ti} and aggregate all the trees in 
sharing a common source as one tree. Set Li = Cov{'Ei,COV ER) . 

Find Ti £ J \ J that maximizes \{Li f] L)\ L\/ci . 

6 . L = L U (Li C\ L) ,J = J U {Ti} . For each Tj £ J \ J , if Tj shares the same 

source with T , Ej = Ej \ Ei , Cj — step 4. 



Fig. 2. Fast Greedy Heuristic to Approximate MMTCP. 
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4 Finding the Optimal Solution in a Tree Topology 

We saw in Theorem Q that we cannot hope to find an efficient algorithm that 
solves any of version of the MMTCP in general. However, this does not rule out 
the possibility that it is possible to solve these problems efficiently on certain 
classes of network topologies. In this section, we study the MMTCP in the case 
that the underlying network topology TV is a tree. This is motivated by the 
hierarchical structure of real networks, which can be thought of as having a 
tree topology with a small number of extra edges. We shall see in Section 0 
that algorithms for the tree topology can be adapted to provide a very effective 
heuristic for such hierarchical topologies. We use this heuristic to provide good 
solutions to the W-MMTCP problem for the topologies of the vBNS network, 
as well as the Abilene network. 

Our algorithm for the tree topology is guaranteed to find the optimal solution 
in polynomial time. We shall describe this algorithm for the (easier) case of 
the W-MMTCP. In order to describe this algorithm more concisely, we shall 
make some simplifying assumptions. In particular, we assume that is a rooted 
binary tree, that the per tree cost of every multicast tree is zero, that C{{a, b)) = 
C{{b,a)) and that the cover requirement on a link can be satisfied from either 
direction. For the W-MMTCP, these assumptions can be removed by making 
the algorithm slightly more complicated, but without significantly increasing 
the running time of the algorithm. For the S-MMTCP, all of the assumptions 
can be removed except for the assumption that TV is a binary tree. 

The algorithm uses the technique of dynamic programming, and starts by 
creating a table, with one row in the table for each link I of the tree, and |S'p 
entries in each row, labeled for 0 < i,j < [S'!. For link I connecting 

nodes u,v G L, where u is closer to the root, let STi be the subtree rooted at 
node V, together with link I and node u. The value computed for entry 
is the minimum possible total cost for the tree STi (removed from the rest of the 
network), subject to the following conditions: all of the links that are required 
to be covered in N are covered in STi, u is a source that generates j multicast 
sessions that are routed across I, and u is also a receiver that receives i multicast 
sessions. If there are less than i senders in STi — u, or j > 0 and there are no 
receivers in STi — u, then we call the pair (i,j) invalid for link I, and the value 
of entry is set to infinity. 

We compute the values in the table one row at a time, in decreasing order 
of the distance of the corresponding links from the root. When I is connected to 
a leaf of the tree N, it is straightforward to compute for all i,j, since 

if (i,j) is valid, then = {i + j)Ci. We now show how to compute the 

remaining entries for a link I connecting nodes u and v, where u is closer 

to the root, and v is connected to two links m and n, as depicted in Figure El 
Since m and n are further from the root than I, we can assume that 
and have already been computed, for 0 < im, jm,in, jn < |<S'|. 

We see that = min(C<m.*,„,i,„> + + (* + j) where 

the minimum is taken over all im, jm,im jn that provide valid multicast flows 
through the node v. Which values of the flows are valid is checked using a rea- 
sonably simple algorithm. Due to lack of space, this algorithm has been deferred 
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to Q. Along with the value we store the values of the im,jrmin and 

jn that resulted in the minimum Call these values the optimal indices 

for If (i,j) is invalid for link I, is set to infinity. Also, if link I 

is in the to be covered set of links, C^i,o,o> is set to oo. By proceeding in this 
fashion from the leaves to the root, we see that we can fill in the entire table. 

To complete the algorithm, we attach a virtual link x to the root of N with 
C(x) = 0, and use the same technique to compute C^x,o,o>- The minimum cost 
for covering the given set of required links in JV is C<^x,o,o>- In order to find the 
actual multicast trees, we first follow the stored optimal indices from C^x,o,o> 
to the leaves of the tree to determine the actual optimal number of flows in 
either direction on each link of the tree. Given this information, a simple greedy 
algorithm finds a set of multicast trees that results in this number of flows. The 
description of this greedy algorithm is left to the full version of the paper. We 
present the details of our algorithm, which we call Tree-Optimal, in Figure El 

Theorem 4. The algorithm Tree-Optimal finds the optimal cost of a solution 
to the W-MMTCP in any binary tree in 0(|if | IIS’!®) steps. 

The proof is provided in P|. Following a similar idea, we can also construct 
an algorithm for solving the S-MMTCP. However, the strong cover requirements 
make that algorithm somewhat more complicated than the algorithm presented 
here. 

In order to deal with the case that N is a tree of arbitrary degree (instead 
of just a binary tree), we transform N into a binary tree N' . To do so, choose 
any node to be the root of the tree N. Then, given a node r with n children, 
create a virtual node r', make it the child of node r and assign zero cost to the 
virtual link (r, r'). Then pick any n — 1 other links attached to node r, and move 
them to node r\ We repeat this process until all nodes are binary; the resulting 
tree is N'. Since the cost of traveling any virtual link is zero, and no virtual 
link is in the set of to be covered links, the cost of an optimal solution for the 
topology N is the same as the optimal cost for the topology N'. Thus, we can 
find the optimal solution for N by transforming N into the topology N', finding 
an optimal solution for the topology N' , and then transforming the solution for 
N' into an equal cost solution for N. 

We also can deal with the general problem where each multicast tree has a 
per tree cost. We do this by adding a virtual node u and virtual link (u,u) for 
each source candidate v, and replacing the source candidate v with the virtual 
node u. We then assign the cost for initializing the multicast tree that was rooted 
at node v as the cost of the link (u, v). The problem then becomes that of finding 
the optimal cost covering without per tree costs in the resulting network. If we 
also want to allow the cost of initializing different multicast trees at the same 
source to vary, we can attach a different virtual node u for each multicast tree 
rooted at v. 

5 Experiments and Findings 

In this section, we explore the effectiveness of the heuristics presented in the 
previous two sections in the context of the Internet2 vBNS backbone network. 
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Fig. 3. A Link I Incident to the Same Node as Links m and n. 

1 . Add a virtual link x whose downstream node is the root of the tree and 
assign zero cost for traveling link x. 

2 . Create a table C, with each row labeled with a link I, and each column 
labeled by <i,j>, for 0 < i,j < |S|. Initialize every entry in the table to 

+CX3 . 

3. For each link I, let SUi and SDi be the number of sources above and 
below I respectively. Let RUi and RDi be the number of receivers above 
and below I respectively. 

4. Vh. such that U is a link attached to a leaf node: 

if (.SDi- == 1) then C<4,o,i> = C{k) 

IfiRDi^ == 1) then 

C<ii,j,k> = C(Li) * (j + k),yj, k ,0 < j < SUi, 0 < k < 1 . endif 

else 

if (RDi- == 1) then C< 4 j',o> = C{Li) * < j < SUi endif 

if Ui € C) then C<4,o,o> = +oo else C<4,o,o> == 0 endif 

5. Choose any link U , such that row li has not been computed, and li is 
incident to a vertex that is also incident to links Ij and Ik such that 
rows Ij and Ik have been computed . Let v be the node they share . 

^<h,P3,Q3> ~ C{li) * (p3 + ^ 3 )) > 

where valid{p 3 ,q 3 ,p 2 ,q 2 ,Pi,qi,v) is true. 

If Ui € 0 then C<4,o,o> = + 0 ° endif 

6. If all the links are done, then stop and return Ccx,o,o> ■ Else go to 
step 5 . 

Fig. 4. The Algorithm Tree-Optimal. This algorithm finds the cost of the 
optimal solution to the W-MMTCP on a binary tree topology. The function 
valid determines whether or not the integers Ps, 93,^27 92, Pi, and qi define valid 
sets of flows at the vertex v. Due to lack of space, this algorithm has been 
deferred to p. 



As this network is sparse, we also consider a set of denser, randomly generated 
networks. 
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5.1 The vBNS Network 

We consider the vBNS Internet2 backbone network m- It maintains native IP 
multicast services using the PIM sparse-mode routing algorithm. The vBNS mul- 
ticast logical topology (as of October 25, 1999) is illustrated in It consists of 
160 nodes and 165 edges. The vBNS infrastructure was built out of ATM. Each 
node represent a physical location and each link represents a physical intercon- 
nection between some two routers from different locations. The link bandwidths 
vary between 45M (DS3) and 2.45G (OC48). Since the more detailed topology 
within each physical location is not available to us, we treat each node as a router 
and focus on the logical topology in our experiments. In addition, we assume 
that the cost of using a link for measurement within one multicast session is 
inversely proportional to its bandwidth. Last, we assume that only the leaves in 
the topology (i.e., node of degree one) can be a sender or a receiver. 



Tree Heuristic. In section E| we proposed the algorithm Tree-Optimal that 
is guaranteed to find optimal solutions in polynomial time for any tree topology. 
We propose and study a heuristic based on that algorithm. This heuristic uses 
the observation that the topology of a network such as vBNS is very close to a 
tree. Furthermore, the bandwidth of the small number of links that create cycles 
tends to be high, and thus presumably have low cost. 

The heuristic can be applied to any topology, and proceeds in four phases. 
In the first phase, the topology is converted to a tree by condensing every cycle 
into a single super-node. In the second phase, algorithm Tree-Optimal is run 
on the resulting tree. This yields a set of multicast trees, defined by a list of 
senders, and for each sender a list of receivers. In the third phase, this set of 
multicast trees is mapped back to the original topology, so that the same set of 
senders each send to their respective receivers. This is guaranteed to cover all 
of the required links that were not condensed into super-nodes, but may leave 
required links that appear in cycles uncovered. The fourth and final phase uses 
the fast greedy heuristic to cover any such edge. 

Note that the cost of the solution obtained by Tree-Optimal is a lower 
bound on the cost of the solution to the actual topology. This implies several 
important properties of this heuristic. Call any link that appears on a cycle in 
the graph a cycle link. If all of the cycle links have zero cost, and no cycle link 
must be covered, then the tree heuristic is guaranteed to produce an optimal 
solution. Also, if there are not many cycle links, or they all have relatively small 
cost, then the solution found by the heuristic will usually be close to the lower 
bound, and thus close to optimal. The fact that the heuristic produces this lower 
bound is also useful, as it allows one to estimate how close to optimal the solution 
produced by the heuristic is. 



Effectiveness of Heuristics. In Section 01 we introduced the greedy algorithm 
and the 0-1 integer programming algorithm for approximating S-MMTCP and 
W-MMTCP, and described their worst case approximation ratio bounds. In order 
to approximate the general MMTCP in polynomial time, we also proposed a fast 
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greedy heuristic in Sectional In this section we study the average performance of 
these algorithms and heuristics through experiments on the Internet 2 backbone 
networks. Since the topologies of these networks are close to tree topologies, 
we include the performance of the tree heuristic on Internet2 backbones in our 
study. 

To create a suite of problem instances, we varied the sizes of the source and 
receiver candidate sets. In addition, for a particular pair of source candidate 
set and receiver candidate set, we chose the size of the set of links that must 
be covered to be proportional to the size of the set of links that the source 
candidate set and the receiver candidate set can identify. For each problem size, 
we generated 100 random problem instances for the vBNS multicast network. 
For each of these problem instances, we determined the cost of the solution found 
by each algorithm. We assumed that all the multicast trees have the same fixed 
initialization cost. 

We ran the algorithms on inputs where the number of source candidates is 
eight and the number of receiver candidates varies from eight to sixteen. For small 
problem instances such as these, the optimal solutions can be computed for these 
problem sizes using exhaustive search, and this can be used to check the quality 
of the approximation results. For both W-MMTCP and S-MMTCP, we used the 
0-1 integer programming, greedy and fast greedy algorithms to approximate the 
100 problem instances on vBNS for each of the problem size. In addition, we used 
the tree heuristic to approximate the W-MMTCP. We compare the performance 
of the algorithms for S-MMTCP in Figure O and W-MMTCP in Figure El In 
both figures, the ratio of the solutions found by the approximation algorithm 
to the optimal solutions is plotted. For each approximation algorithm, we sort 
the ratios in ascending order. Thus, for example, problem instance 1 for each 
algorithm represents the instance where that algorithm performed the closest to 
optimal, and may correspond to different inputs for the different algorithms. We 
present plots for two different problem sizes: 8 sources and 8 receivers, as well 
as 8 sources and 16 receivers. 

In Figure 0 it is surprising to see that the fast greedy algorithm produces 
the same solution as the greedy algorithm and that the 0-1 integer programming 
algorithm yields the optimal solution on most inputs when the problem size is 
small. As the problem size increases, the 0-1 integer programming is less likely 
to produce the optimal solution and the difference between the fast greedy and 
the optimal seems to increase slowly. 

In the case of approximating W-MMTCP, the tree heuristic out-performs 
both the greedy algorithm and the fast greedy algorithm on most problem in- 
stances. The quality of the 0-1 integer programming algorithm decreases as the 
problem size increases. The results from the fast greedy are only slightly worse 
than those from the greedy algorithm. The difference between the fast greedy 
algorithm and greedy algorithm seems to change very slowly as the problem size 
increases. 

We have also conducted a similar study with similar results on the Internet2 
Abilene backbone network EH]. Details of this study are found in [T| 
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8 Sources and 8 Receivers. 



8 Sources and 16 Receivers. 



Fig. 5. Comparison of Approximation Algorithms for the S-MMTCP on vBNS. 





Fig. 6. Comparison of Approximation Algorithms for the W-MMTCP on vBNS. 



5.2 Experiments on Dense Networks 

vBNS is quite sparse, i.e., it only contains a very small number of additional 
edges beyond that of a tree topology containing the same number of nodes. 
In this section, we investigate how our algorithms perform on denser network 
than vBNS. Unfortunately, we had no such multicast topologies available to us. 
Instead, we make use of randomly generated topologies. 

We generated ten 100-node transit-stub undirected graphs using GT-ITM 
(Georgia Tech internetwork topology model). For more details about the transit- 
stub network model, please refer to [Ej. The average out-degree is in the range 
of [2.5,5]. We assigned two costs to each edge in the graphs, one for each direc- 
tion. These cost are uniformly distributed in [2,20]. By randomly picking the 
source candidate set, receiver candidate set and to-be covered set of links and 
then assigning costs to the edges, we generated ten problem instances for each 
graph. We ran all algorithms on a total of 100 problem instances and compared 
their performance. We assumed that all the multicast trees have the same fixed 
initialization cost. 

We ran the algorithms on inputs where the number of source candidates is 
eight and the number of receiver candidates varies from eight to sixteen. We used 
the 0-1 integer programming, greedy and fast greedy algorithms to approximate 
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8 Sources and 8 Receivers. 8 Sources and 16 Receivers. 

Fig. 7. Comparison of Approximation Algorithms for the S-MMTCP on 100- 
node transit-stub. 





8 Sources and 8 Receivers. 8 Sources and 16 Receivers. 

Fig. 8. Comparison of Approximation Algorithms for the W-MMTCP on vBNS 
on 100-node transit-stub. 



the 100 problem instances for each of the problem sizes. We compare the perfor- 
mance of the algorithms for the S-MMTCP in Figure Q and the W-MMTCP in 
Figure 0. In both figures, the ratio of the solution found by the approximation 
algorithms to the optimal solution is plotted. For each approximation algorithm, 
we sorted the ratios ascendantly. 

In Figure 0 it is surprising to see that the fast greedy algorithm produces the 
same solution as the greedy algorithm and the 0-1 integer programming yields 
the optimal solution on most inputs. As the problem size increases, the greedy 
and fast greedy algorithm are less likely to produce the optimal solution. 

In the case of approximating W-MMTCP, the 0-1 integer programming algo- 
rithm yields the optimal solution on about half the problem instances. However, 
it yields worse results than the greedy and fast greedy algorithm for about forty 
percent of the problem instances. The results from the fast greedy are slightly 
worse than those from the greedy algorithm in most of cases. The difference 
between the fast greedy algorithm and greedy algorithm seems to increase very 
slowly as the problem size increases. 
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6 Conclusions 

In this paper we focussed on the problem of selecting trees from a candidate 
set in order to cover a set of links of interest. We identified three variation of 
this problem according to the definition of cover and addressed two questions 
for each of them: 

— Is it possible to cover the links of interest using trees from the candidate set? 

— If the answer to the first question is yes, what is the minimum set of trees 
that can cover the links? 

We proposed computationally efficient algorithms for the first of these questions. 
We also established, with some exceptions, that determining the minimum cost 
set of trees is a hard problem. Moreover, it is a hard problem even to develop 
approximate solutions. One exception is when the underlying topology is a tree 
in which case we present efficient dynamic programming algorithms for two of 
the covers. We also proposed several heuristics and showed through simulation 
that a greedy heuristic that combines trees with three or fewer receivers performs 
reasonably well. 
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